This gets the DataFrame as backend for Visualization. For now, this completes everything for July.

## Getting and saving the cleaned twitter data
* We get the necessary columns for the twitter data from the  .json files by running the code on the cluster and saving the dataframes as parquet.
* The code to get the data is in cleaned_twitter.py

In [2]:
import os
import sys

spark_path = os.environ["SPARK_PATH"]
os.environ['SPARK_HOME'] = spark_path
os.environ['HADOOP_HOME'] = spark_path

sys.path.append(spark_path + "/bin")
sys.path.append(spark_path + "/python")
sys.path.append(spark_path + "/python/pyspark/")
sys.path.append(spark_path + "/python/lib")
sys.path.append(spark_path + "/python/lib/pyspark.zip")
sys.path.append(spark_path + "/python/lib/py4j-0.9-src.zip")

In [3]:
import pandas as pd
import os
from tqdm import tqdm
from pyspark import SparkConf, SparkContext
import pyspark.sql
from pyspark.sql.types import *
from pyspark.sql import SQLContext
import time

conf = SparkConf().setAppName("ADA-gcl")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
#sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")


## Load the paqueret data

## DataFrames
* df1: num_followings, num_followers, mentions
* df2: gender, geo_state
* df3: language, sentiment, location, main, date, 

In [4]:
path_user_info = 'file:///Users/tbfang/Documents/EPFL/ADA/ADA-Project/twitter_data/twitter_user_info_jul.parquet'
df1 = sqlContext.read.parquet(path_user_info)

path_gender = 'file:///Users/tbfang/Documents/EPFL/ADA/ADA-Project/twitter_data/twitter_gender_jul.parquet'
df2 = sqlContext.read.parquet(path_gender)

path_sentiment = 'file:///Users/tbfang/Documents/EPFL/ADA/ADA-Project/twitter_data/twitter_cleaned_jul.parquet'
df3 = sqlContext.read.parquet(path_sentiment)

In [5]:
df1 = df1.toPandas()
df2 = df2.toPandas()
df3 = df3.toPandas()

In [6]:
print(df1.shape[0], df2.shape[0], df3.shape[0])

795026 795026 795026


In [7]:
orig_df = df1.merge(df2,on='id').merge(df3,on='id')

In [8]:
df = df1.merge(df2,on='id').merge(df3,on='id')

In [9]:
orig_df.count()

id               795026
num_followers    795026
num_following    794949
mentions         285388
canton           486388
gender           795012
language         795026
sentiment        700562
location         777684
main             795026
date             795026
dtype: int64

In [10]:
df.mentions.fillna("",inplace=True)

## Analyzing DF
* 28/80, ~35% haveinfo about chatters
* 48/80, ~60% have cantons > 50%, pretty good!
* 70/80, ~87.5% have sentiment

### DataFrame Transformations
* DONE mentions -> number
* DONE parse date into month and time of day (6am to 6pm) (6pm to 6am) is night

### converting date

In [11]:
def month(iso_time):
    time_struct = time.strptime(iso_time, "%Y-%m-%dT%H:%M:%SZ")
    return time_struct.tm_mon

In [12]:
def time_of_day(iso_time):
    time_struct = time.strptime(iso_time, "%Y-%m-%dT%H:%M:%SZ")
    hour = time_struct.tm_hour
    if 6 <= hour < 18:
        return "Day"
    else:
        return "Night"

In [13]:
df['num_mentions'] = df.mentions.apply(len)
df['time_of_day'] = df.date.apply(time_of_day)
df['month'] = df.date.apply(month)

In [14]:
df.head()

Unnamed: 0,id,num_followers,num_following,mentions,canton,gender,language,sentiment,location,main,date,num_mentions,time_of_day,month
0,1467359435565400044,66,89.0,[Urs_Buhler],,FEMALE,en,NEGATIVE,Svizzera,@Urs_Buhler and hear then you deserve so much ...,2016-07-01T07:46:46Z,1,Day,7
1,1467365796334500043,66,89.0,[Urs_Buhler],,FEMALE,en,POSITIVE,Svizzera,"@Urs_Buhler can not be bought, and we must alw...",2016-07-01T09:35:26Z,1,Day,7
2,1467359036377100186,66,89.0,[Urs_Buhler],,FEMALE,en,POSITIVE,Svizzera,"@Urs_Buhler In that, there's always someone wh...",2016-07-01T07:41:32Z,1,Day,7
3,1467366116249400147,66,89.0,[Urs_Buhler],,FEMALE,en,POSITIVE,Svizzera,@Urs_Buhler Am I wrong? So bravo Urs. Let me g...,2016-07-01T09:37:31Z,1,Day,7
4,1467365778307300193,66,89.0,[Urs_Buhler],,FEMALE,en,POSITIVE,Svizzera,@Urs_Buhler I consider every person in this wo...,2016-07-01T09:34:21Z,1,Day,7


## Tweet Level

In [15]:
df_author = df.drop(['mentions','location','main','date','num_mentions'],axis=1)
df.drop(['mentions','location','main','date','num_followers','num_following'],axis=1,inplace=True)
df.dropna(inplace=True)
# keeps 48/80 = 60% of data
# keeps only gender is known

## Cleaning up Cantons

Currently, there are 170 "cantons" from geo_state, that needs to be cleaned...

In [16]:
len(pd.DataFrame(df['canton'].unique(),columns=['Canton']))

170

## Handling Authors

Getting distinct authors only, I assume if they have the same # of followers, following and gender they are the same person.

In [17]:
df_author.shape[0]

795026

In [18]:
df_author.drop_duplicates(['num_followers', 'num_following', 'gender'], inplace=True) # only the authors

Need a notion of sample size, has to be significant. Need to remove outliers for num_followers

In [19]:
df_author = df_author.groupby(['canton','gender','language','time_of_day','month'], as_index=False).mean()

In [20]:
sentiment_map = {'NEUTRAL': 0, 'POSITIVE':1, 'NEGATIVE':-1}
# df['sentiment_val'] = 
df['sentiment'] = df.sentiment.map(sentiment_map)

In [21]:
df_user = df.groupby(['canton','gender','language','time_of_day','month'], as_index=False).mean()

In [22]:
df_final = df_user.merge(df_author,on=['canton','gender','language','time_of_day','month'])

In [23]:
df_final.shape[0]

778

### Evaluating the goodness of the conditions
We want to make sure there is enough in each category. We want to minimize the # of 0 sentiments.

In [24]:
df_final[df_final['sentiment'] == 0].shape[0] # sentiment = 0 means 

366

## Final DF

* canton, gender, language, time of day, month are things we want to filter by
* sentiment, num mentions, num followers, and num_following are all averages for those specified categories. For num_followers and num_followings, it only averages the unique tweet authors. Sentiment and num mentions are averages over all tweets in those filters.

### Future work
* This dataframe needs the cantons to be reduced
* Then, we also want to make sure there is enough tweets for each category

Stuff still left to do:
* map geo_state to cantons, currently there are 170 distinct "cantons", need to use Google Maps API to map it to 26 cantons.

Plan:
* get July to October for all geo_states and then use Google Maps to get a mapping from geo_states to cantons.
* map location to geo_states to cantons for January to June.

Spark/cluster work:
* January gender needs to be fixed
* Aug, Sept for gender isn't there...

In [25]:
df_final

Unnamed: 0,canton,gender,language,time_of_day,month,sentiment,num_mentions,num_followers,num_following
0,Aargau,FEMALE,de,Day,7,0.064516,0.917051,807.875000,356.625000
1,Aargau,FEMALE,de,Night,7,0.031250,0.750000,482.571429,375.000000
2,Aargau,FEMALE,en,Day,7,0.228330,0.463002,1598.235294,394.882353
3,Aargau,FEMALE,en,Night,7,0.116022,0.638122,659.629630,603.148148
4,Aargau,FEMALE,es,Day,7,0.000000,0.538462,156.500000,81.500000
5,Aargau,FEMALE,es,Night,7,0.000000,0.510791,1293.333333,530.666667
6,Aargau,FEMALE,et,Day,7,0.000000,0.000000,1960.000000,599.000000
7,Aargau,FEMALE,fr,Night,7,0.000000,1.200000,1958.000000,599.000000
8,Aargau,FEMALE,in,Day,7,0.000000,0.000000,1795.500000,472.000000
9,Aargau,FEMALE,in,Night,7,0.000000,0.000000,308.000000,271.000000


## creating the Canton Map

In [28]:
canton_mapping = orig_df[['canton','location']].dropna().drop_duplicates()

In [30]:
canton_mapping.head(10)

Unnamed: 0,canton,location
56,Zurich,Sissach
57,Solothurn,Suïssa
58,Valais,Conthey
59,Schaffhausen,Schaffhausen
64,Thurgau,Winterthur
68,Bern,Biel/Bienne
70,Bern,Thun
71,Zurich,Winterthur
72,Zurich,Baden
74,Lucerne,Luzern


## Find Sentiment Patterns

In [None]:
sent_map = {'NEUTRAL': 0, 'POSITIVE':1, 'NEGATIVE':-1}
df['Sentiment_Val'] = df['sentiment'].replace(sent_map)

df_viz = pd.DataFrame(df.groupby('language').mean()['Sentiment_Val'])
df_viz.reset_index(inplace=1) 
df_viz.columns = ['language', 'Sentiment_Val']

In [None]:
df_viz

If you want to, you can save the full df as one big parquet file. upload it to your cluster home, and load it and use pandas

In [None]:
# of followers, # of followings, # of retweets

In [None]:
for i in range(1000):
    print(df.main[i])