<a href="https://colab.research.google.com/github/ralsouza/apache_spark_real_time_analytics/blob/master/notebooks/spark_streaming_twitter/01_spark_streaming_twitter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PySpark Setup

In [None]:
!apt-get update

In [None]:
# Install the dependencies
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

In [None]:
# Environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"
 
# tornar o pyspark "importável"
import findspark
findspark.init('spark-2.4.4-bin-hadoop2.7')

In [None]:
# Libraries and Context Setup
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf

# create the session
conf = SparkConf().set("spark.ui.port", "4050")

# create the context
sc = pyspark.SparkContext(conf=conf)

# Instance Spark Session
spark = SparkSession.builder.master('local').appName('My-SparkSQL').getOrCreate()

# Create the SQL Context
sqlContext = pyspark.SQLContext(sc)

In [None]:
# Check context
print(sc)

<SparkContext master=local[*] appName=pyspark-shell>


## Other packages to streaming - Twitter

In [None]:
!pip install requests_oauthlib
!pip install twython
!pip install nltk

## Install Modules

In [None]:
from pyspark.streaming import StreamingContext
from requests_oauthlib import OAuth1Session
from operator import add
from time import gmtime, strftime
import requests
import time
import string 
import ast

## Install NLTK modules

In [None]:
import nltk
from nltk.classify import NaiveBayesClassifier
from nltk.sentiment import SentimentAnalyzer
from nltk.corpus import subjectivity
from nltk.corpus import stopwords
from nltk.sentiment.util import *

In [None]:
# Update frequency
BATCH_INTERVAL = 5

In [None]:
# Making the StreamingContext
ssc = StreamingContext(sc,batchDuration=BATCH_INTERVAL)

An essencial part to create a sentiment analysis algorithm, such as any data mining algorithm, is to have a comprehensive data or "corpus" to learn, as well as a dataset to test and to ensure it perfectly meet the requeriments.

It allows you to adjust the algorithm to deduce better (or more accurate) natural language characteristics that could be extracted from the text and that will contribuite to the sentiment classification, instead of using a generic approach.

We will take as a work base a train dataset provided by Michigan University, to Kaggle competitions -  https://inclass.kaggle.com/c/si650winter11.

This dataset contains 1.578.627 classified tweets and each row is marked as:
* 1 with regard positive sentiment
* 0 with regard negative sentiment

In [None]:
# Data file path
file_path = '/content/drive/My Drive/Colab Notebooks/08-apache-spark/data/sentimentos.csv'

In [None]:
rdd_sent = sc.textFile(file_path)

In [None]:
# Removing header
header = rdd_sent.take(1)[0]
dataset = rdd_sent.filter(lambda row: row != header)

In [None]:
type(dataset)

pyspark.rdd.PipelinedRDD

In [None]:
# This function splits the columns in each row, creating a tuple and removing 
# the punctiation

def get_row(row):
  row = row.split(",")
  sentiment = row[1]
  tweet = row[3].strip()
  translator = str.maketrans({key: None for key in string.punctuation})
  tweet = tweet.translate(translator)
  tweet = tweet.split(' ')
  tweet_lower = ()
  for word in tweet:
    tweet_lower.append(word.lower())
  return (tweet_lower, sentiment)

In [None]:
# Apply the function in each row in the dataset
ds_train = dataset.map(lambda row: get_row(row))

In [None]:
# Create an object SentimentAnalyser
sentiment_analyzer = SentimentAnalyzer()