# Fugue

[Fugue](https://github.com/fugue-project/fugue) was just open sourced. It unifies core interfaces for computing, and makes your code run on Pandas, Spark and Dask without change.

You can write Fugue workflows using [programming interface](https://fugue-tutorials.readthedocs.io/en/latest/tutorials/dag.html) or [Fugue SQL](https://fugue-tutorials.readthedocs.io/en/latest/tutorials/sql.html) based on the project needs. In this notebook, I am going to use Fugue SQL which is an enriched SQL language, it can use , it can run on different backends as well.

In this particular case, you see 0 dependency on Fugue and Spark, you only see native python code and SQL, but they run on Fugue and Spark


# Acknowledgement

I find [Manch Hui](https://www.kaggle.com/manchunhui)'s [notebook](https://www.kaggle.com/manchunhui/us-presidential-election-sentiment-analysis) and [Maksym Shkliarevskyi](https://www.kaggle.com/maksymshkliarevskyi)'s [notebook](https://www.kaggle.com/maksymshkliarevskyi/us-election-eda-sentiment-analysis) are very helpful. Thank you!


# Objective

We try to:

* do basic data exploration and sentiment analysis using existing popular tools
* demonstrate a compute framework agnostic and scale agnostic way for data analytics
* show SQL mindset/approach is also great for science work, especially when you can chain them together
* show that native python + SQL is powerful enough and Fugue can invisibly orchestrate them

We DO NOT try to:

* get the best quality result, which requires significant effort on data cleaning and fine tuning
* compete with pandas on speed. For this dataset, pandas may be faster, but the Fugue approach can handle significantly larger scale data with no code change, and the performance will still be optimal, this is what pandas can't do


# Setup environment

You only need to install the [fuggle](https://github.com/fugue-project/fuggle) package (Fugue for Kaggle users), and call `setup` with a backend (this will also enable the highlight of Fugue SQL in side cells). And in this notebook, we use spark.

In [None]:
!pip install fuggle>=0.0.6

In [None]:
from fuggle import setup, Plot, PlotBar, PlotBarH, PlotLine

setup("spark")

# Merge Data

It will be much simpler if we can merge the multiple csv files into one parquet file for the following setups to use, benefits:
* Reading csv is normally slower than reading parquet
* When saving to parquet, all columns data types become explicity, so the following steps will have no concern

Because this Kaggle dataset is designed more for pandas to process locally, so we use pandas to convert the data.

**Notice:** the dataset quality is questionable. For example we need to firstly conflate `United States` and `United States of America`

In [None]:
%%time
from time import sleep
import pandas as pd


def load_raw_data() -> pd.DataFrame:
    dfs = []
    for hashtag in ["donaldtrump","joebiden"]:
        df = pd.read_csv(f"../input/us-election-2020-tweets/hashtag_{hashtag}.csv", lineterminator='\n')
        df["hashtag"] = hashtag # add hashtag column
        dfs.append(df)
        print(df.shape)
    all_df = pd.concat(dfs)
    for col in ["created_at", "user_join_date", "collected_at"]:
        all_df[col] = pd.to_datetime(all_df[col]).astype('datetime64[us]')
    for col in ["tweet_id","likes","retweet_count","user_id","user_followers_count"]:
        all_df[col] = all_df[col].astype(int)
    all_df["country"] = all_df["country"].replace("United States of America", "United States")
    return all_df.reset_index(drop=True)

df = load_raw_data()
df.to_parquet("/kaggle/working/tweets_raw.parquet")
# here we also save 5% of sample data into another parquet file
df.sample(frac=0.05, replace=False, random_state=1).to_parquet("/kaggle/working/tweets_sample.parquet")


Now let's use Fugue SQL on Spark (and we are using 4 cores Kaggle provides) to load the raw parquet and save to smaller partitions. This has performance benefit for spark execution -- the more partitions you have, the better load balance but the worse overhead. As you can see, the following SQL code is a mix of standard SQL and enriched syntax.

In [None]:
%%fsql
LOAD "/kaggle/working/tweets_raw.parquet"
DROP COLUMNS collected_at
SELECT DISTINCT * WHERE tweet_id>0  # dedup and remove invalid records
SAVE AND USE PREPARTITION 16 OVERWRITE "/kaggle/working/tweets.parquet"
PRINT ROWCOUNT
SELECT country, COUNT(*) AS ct GROUP BY country ORDER BY ct DESC
PRINT ROWS 100
SELECT * WHERE country IS NULL
PRINT

You can see there are many records without location information. **Analytics based on location may not be accurate for this dataset**. But for demonstration (of Fugue) purpose, we assume it's alright.

Now let's get some simple stats from the data. You start to see some standard SQL. Using Spark backend, they are executed as Spark SQL.

In [None]:
%%fsql
df = LOAD "/kaggle/working/tweets.parquet"

SELECT hashtag, country, COUNT(*) AS ct 
    FROM df
    WHERE country IS NOT NULL
    GROUP BY hashtag, country
    ORDER BY ct DESC
PRINT TITLE "by country"

SELECT hashtag, state, COUNT(*) AS ct 
    FROM df
    WHERE country = "United States"
    GROUP BY hashtag, state
    ORDER BY ct DESC
PRINT TITLE "by state"

SELECT *, DATE_TRUNC("Hour", created_at) AS ts FROM df
SELECT ts, SUM(IF(hashtag="donaldtrump",0,1)) AS biden, SUM(IF(hashtag="donaldtrump",1,0)) AS trump GROUP BY ts
# In fuggle, there are visualization extensions such as PlotLine, they are simple and quick but may not be beautiful
# You can write your own exetension for specific cases and you can make the charts amazing
OUTPUT USING PlotLine(x="ts", order_by=["ts"], title="hourly tweets", logy=true)

# Preprocessing (data cleaning)

We need to normalize the tweet text, remove special characters, stop words and lemmatize them. For this part, I borrowed the idea from [Manch Hui](https://www.kaggle.com/manchunhui)'s great [notebook](https://www.kaggle.com/manchunhui/us-presidential-election-sentiment-analysis).

So the first question is, for a local pandas dataframe how we do it? In the following code `lemmatize` is to do everything we mentioned given a single piece of text. `lemmatize_tweet` is a wrapper function dealing with a pandas dataframe input. And you can see we wrote some assertion as simple unit tests in notebook.

In [None]:
import re
import nltk
import unicodedata
from typing import Iterable, Dict, Any, List

def lemmatize(text):
    filtered_sent=""
    wnl = nltk.stem.WordNetLemmatizer()
    stopwords = nltk.corpus.stopwords.words('english')
    text = (unicodedata.normalize('NFKD', text)
            .encode('ascii', 'ignore')
            .decode('utf-8', 'ignore')
            .lower())
    text = re.sub(r'https?.+|[^(a-zA-Z)(0-9)\s]',' ',text)
    words = text.split()
    return [wnl.lemmatize(word) for word in words if word not in stopwords]

assert ['trump', 'word', 'c', 'dd', '123'] == lemmatize("#trump:a,wOrds.c      .dd. 123 https://asdasdf, http://asdasdf,")

# schema: *-tweet+words:[str]
def lemmatize_tweet(df:pd.DataFrame) -> pd.DataFrame:
    df["words"] = df["tweet"].apply(lemmatize)
    return df.drop(["tweet"],axis=1)


tdf = pd.DataFrame([["abc def,g http3"]],columns=["tweet"])
lemmatize_tweet(tdf)

Actually, `lemmatize_tweet` is already a [Fugue extension](https://fugue-tutorials.readthedocs.io/en/latest/tutorials/extensions.html). The comment is called schema hint, it tells Fugue that the output will be the input excluding `tweet` column and plus a `words` column whose type is an array of strings.

Now let's use it in Fugue SQL to generate another file `tweets_lem.parquet`.

In [None]:
%%fsql
LOAD "/kaggle/working/tweets.parquet"
SELECT tweet_id, FIRST(tweet) AS tweet GROUP BY tweet_id
TRANSFORM USING lemmatize_tweet
SAVE AND USE OVERWRITE "/kaggle/working/tweets_lem.parquet"
PRINT ROWCOUNT

# Compute Sentiment Polarities

Again, the idea of computing polarities is borrowed from [Manch Hui](https://www.kaggle.com/manchunhui)'s [notebook](https://www.kaggle.com/manchunhui/us-presidential-election-sentiment-analysis) and [Maksym Shkliarevskyi](https://www.kaggle.com/maksymshkliarevskyi)'s [notebook](https://www.kaggle.com/maksymshkliarevskyi/us-election-eda-sentiment-analysis). I will compute both VADER and TextBlob scores.

First of all let's consider the problem with a local pandas dataframe input

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from textblob import TextBlob

# input_has: words:[str]
# schema: *,vader_score:double,textblob_score:double-words
def compute_polarities(df:pd.DataFrame) -> pd.DataFrame:
    sid = SentimentIntensityAnalyzer()
    text = df["words"].apply(lambda w: " ".join(w))
    df["vader_score"] = text.apply(lambda t: sid.polarity_scores(t)["compound"])
    df["textblob_score"] = text.apply(lambda t: TextBlob(t).sentiment.polarity)
    return df.drop(["words"],axis=1)

tdf = pd.DataFrame([["i really love it".split(" "),1]], columns=["words","a"])
compute_polarities(tdf)

Looking at the schema hint of `compute_polarities`, we know two things:

* the input data must contain `words` column, if not, an error will raise before execution
* the output will remove `words` and add two score columns

Again, it is just a native python function like you normally do, just add some hints using comments, and also add a small test after you write the code. Now let's compute using Fugue SQL

In [None]:
%%fsql
TRANSFORM (LOAD "/kaggle/working/tweets_lem.parquet") USING compute_polarities
SAVE OVERWRITE "/kaggle/working/tweets_polarities.parquet"

# Polarity Analysis

The following scripts are to analyze the polarities from different perspectives.

## Polarity on locations

The following script computes likes weighted average polarities based on US states, and country

In [None]:
%%fsql
df = LOAD "/kaggle/working/tweets.parquet"
polarity = LOAD "/kaggle/working/tweets_polarities.parquet"

data = 
    SELECT hashtag, state, country, vader_score, textblob_score, likes
    FROM df INNER JOIN polarity ON df.tweet_id = polarity.tweet_id
    
SELECT 
        state,
        SUM(IF(hashtag="donaldtrump",NULL,vader_score*likes))/SUM(IF(hashtag="donaldtrump",0,likes)) AS biden,
        SUM(IF(hashtag="donaldtrump",vader_score*likes,NULL))/SUM(IF(hashtag="donaldtrump",likes,0)) AS trump
    FROM data
    WHERE country = "United States" AND state != "Guam"
    GROUP BY state
    
OUTPUT USING PlotBarH(x="state", order_by="trump desc", height=2.0)


SELECT 
        country,
        SUM(IF(hashtag="donaldtrump",NULL,vader_score*likes))/SUM(IF(hashtag="donaldtrump",0,likes)) AS biden,
        SUM(IF(hashtag="donaldtrump",vader_score*likes,NULL))/SUM(IF(hashtag="donaldtrump",likes,0)) AS trump,
        COUNT(*) AS ct
    FROM data
    GROUP BY country
    
SELECT * ORDER BY ct DESC LIMIT 10
OUTPUT USING PlotBar(x="country", y=["biden","trump"], order_by="ct desc")

## Polarity on time

In [None]:
%%fsql
df = SELECT * FROM (LOAD "/kaggle/working/tweets.parquet") WHERE country = "United States"
polarity = LOAD "/kaggle/working/tweets_polarities.parquet"

data = 
    SELECT DATE_TRUNC("Day", created_at) AS ts, hashtag, likes, country, vader_score, textblob_score
    FROM df INNER JOIN polarity ON df.tweet_id = polarity.tweet_id
    
SELECT 
        ts,
        SUM(IF(hashtag="donaldtrump",NULL,textblob_score*likes))/SUM(IF(hashtag="donaldtrump",0,likes)) AS biden,
        SUM(IF(hashtag="donaldtrump",textblob_score*likes,NULL))/SUM(IF(hashtag="donaldtrump",likes,0)) AS trump
    FROM data
    GROUP BY ts
    
OUTPUT USING PlotLine(x="ts", order_by="ts")