# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Mini-Project: End-to-end analytics application using Pyspark

## Problem Statement

Perform sentiment classification by analyzing the tweets data with Pyspark

## Learning Objectives

At the end of the mini-project, you will be able to :

* analyze the text data using pyspark
* derive the insights and visualize the data
* implement feature extraction and classify the data
* train the classification model and deploy

### Dataset

The dataset chosen for this mini-project is **[Twitter US Airline Sentiment](https://data.world/socialmediadata/twitter-us-airline-sentiment)**. It is a record of tweets about airlines in the US. It was created by scraping Twitter data from February 2015. Contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").  Along with other information, it contains ID of a Tweet, the sentiment of a tweet ( neutral, negative and positive), reason for a negative tweet, name of airline and text of a tweet.

## Information

The airline industry is a very competitive market that has grown rapidly in the past 2 decades. Airline companies resort to traditional customer feedback forms which in turn are very tedious and time consuming. This is where Twitter data serves as a good source to gather customer feedback tweets and perform sentiment analysis. This dataset comprises of tweets for 6 major US Airlines and a multi-class classification can be performed to categorize the sentiment (neutral, negative, positive). For this mini-project we will start with pre-processing techniques to clean the tweets and then represent these tweets as vectors. A classification algorithm will be used to predict the sentiment for unseen tweets data. The end-to-end analytics will be performed using Pyspark.

## Grading = 10 Points

#### Install Pyspark

In [None]:
#@title Install packages and download the dataset
!pip -qq install pyspark
!pip -qq install handyspark
!wget -qq https://cdn.iisc.talentsprint.com/CDS/MiniProjects/US_Airline_Tweets.csv
print("Packages installed successfully and dataset downloaded!!")

#### Import required packages

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from handyspark import *
#import seaborn as sns
from matplotlib import pyplot as plt
import re
import string
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.classification import NaiveBayes
from pyspark.sql.types import ArrayType, StringType

In [None]:
# NLTK imports
import nltk
nltk.download('punkt')
# Download stopwords
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

### Data Loading

#### Start a Spark Session

Spark session is a combined entry point of a Spark application, which came into implementation from Spark 2.0. It provides a way to interact with various Spark functionalities, with a lesser number of constructs.

In [None]:
spark = SparkSession.builder.appName('TwitterSentiment').getOrCreate()

#### Load the data and infer the schema

To load the dataset use the `read.csv` with `inferSchema` and `header` as parameters.

In [None]:
dataset = spark.read.csv("/content/US_Airline_Tweets.csv",inferSchema=True,header=True)

In [None]:
dataset.show(1)

In [None]:
dataset.count()

### EDA & Visualization ( 2 points)

#### Visualize the horizontal barplot of airline_sentiment (positive, negative, neutral)

Convert the data to handyspark and remove the other records from the column except 3 values mentioned above and plot the graph

In [None]:
handyDf = dataset.toHandy()

handyDf = handyDf.filter(handyDf["airline_sentiment"].isin(["neutral","positive","negative"]))
sentiments = handyDf.cols['airline_sentiment'][:].value_counts() #.isin(['neutral','positive','negative'])]
sentiments.plot.barh()

In [None]:
handyDf.count()

#### Plot the number of tweets received for each airline

In [None]:
handyDf.cols['airline'][:].value_counts().plot.bar()

#### Visualize a stacked barchart of 6 US airlines and 3 sentiments on each bar

* Display the count corresponding to each sentiment in each bar. [hint](https://priteshbgohil.medium.com/stacked-bar-chart-in-python-ddc0781f7d5f)

In [None]:
grouped = handyDf.groupby('airline','airline_sentiment').agg(count('airline_sentiment'))
neutral = grouped.filter(grouped['airline_sentiment']=='neutral').cols['count(airline_sentiment)'][:]
positive = grouped.filter(grouped['airline_sentiment']=='positive').cols['count(airline_sentiment)'][:]
negative = grouped.filter(grouped['airline_sentiment']=='negative').cols['count(airline_sentiment)'][:]

In [None]:
airlines = list(set(handyDf.cols['airline'][:]))
plt.figure(figsize=(10,7))

ax1 = plt.bar(airlines,neutral.values,color='y',label='neutral')
ax2 = plt.bar(airlines,positive.values,bottom=neutral.values,color='b',label='positive')
ax3 = plt.bar(airlines,negative.values,bottom=neutral.values+positive.values,color='r',label='negative')
for r1, r2, r3 in zip(ax1, ax2, ax3):
    h1 = r1.get_height()
    h2 = r2.get_height()
    h3 = r3.get_height()
    plt.text(r1.get_x() + r1.get_width() / 2., h1 / 2., "%d" % h1, ha="center", va="center", color="white", fontsize=16, fontweight="bold")
    plt.text(r2.get_x() + r2.get_width() / 2., h1 + h2 / 2., "%d" % h2, ha="center", va="center", color="white", fontsize=16, fontweight="bold")
    plt.text(r3.get_x() + r3.get_width() / 2., h1 + h2 + h3 / 2., "%d" % h3, ha="center", va="center", color="white", fontsize=16, fontweight="bold")
plt.legend()#['neutral','positive','negative'])
plt.show()

#### Visualize the horizontal barplot of negative reasons

In [None]:
handyDf1 = handyDf.filter(~handyDf["negativereason"].isNull())
negativereason = handyDf1.cols['negativereason'][:].value_counts() #.isin(['neutral','positive','negative'])]
negativereason.plot.barh()

### Pre-processing (3 points)

#### Check the null values and drop the records where the text value is null

In [None]:
# check the null values in text column
dataset.filter(dataset.text.isNull()).count()

In [None]:
# filter out and remove null values from text column
dataset_filtered = dataset.filter(dataset.text.isNotNull())

In [None]:
# verify the null count
dataset.count(), dataset_filtered.count()

#### Fill the null values with 0 in all the columns except the target

The target should not be empty. Ensure that all features are integer type, convert if needed.

In [None]:
fillNull = udf(lambda x: 0 if x == None else x)
dataset_filtered = dataset_filtered.withColumn('negativereason_confidence',fillNull(dataset_filtered['negativereason_confidence']).astype('int'))
dataset_filtered = dataset_filtered.withColumn('airline_sentiment_confidence',fillNull(dataset_filtered['airline_sentiment_confidence']).astype('int'))

#### Preprocessing and cleaning the tweets

* Convert the text to lower case
* Remove usernames, hashtags and links from the text (tweets)

In [None]:
puncts = "!#$%&\()*+,-./:;<=>?@[\\]^_`{|}~"
def words_process(text):
    text = text.lower() # lowercase
    text.replace(r'http?://[^\s<>"]+|www\.[^\s<>"]+', '') # Removing hyperlinks from all the tweets
    text.replace('\d+', '') # Removing numbers from all the tweets
    text = " ".join([i for i in text.split() if i.find("@")== -1]) # removing usernames
    text = text.replace('#','') # Removing hashtags, including the text, from all the tweets
    text = re.sub(r'[^a-zA-Z0-9 ]',r'',text)
    return text

words = udf(words_process)
dataset_filtered = dataset_filtered.withColumn('text_processed',words(dataset_filtered['text']))
dataset_filtered.show()

#### Tokenize the text sentence into words using nltk sentence tokenizer

In [None]:
word_udf = udf(lambda x: word_tokenize(x), ArrayType(StringType()))
dataset_filtered = dataset_filtered.withColumn("wordss", word_udf("text_processed"))
dataset_filtered.show(5)

#### Remove the stopwords from tokenized words

In [None]:
stop_words = set(stopwords.words('english'))
print(stop_words)

In [None]:
punct_udf1 = udf(lambda x: [w for w in x if not w in stop_words])
dataset_filtered = dataset_filtered.withColumn("wordss", punct_udf1("wordss"))

array_udf = udf(lambda x: x, ArrayType(StringType()))
dataset_filtered = dataset_filtered.withColumn("wordss", array_udf("wordss"))

#### Apply Lemmatization to the words

In [None]:
def lemmatize(text_arr):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in text_arr]
lem = udf(lemmatize)
dataset_filtered = dataset_filtered.withColumn("wordss", lem("wordss"))

array_udf = udf(lambda x: x, ArrayType(StringType()))
dataset_filtered = dataset_filtered.withColumn("wordss", array_udf("wordss"))

In [None]:
dataset_filtered.show(5)

### Feature Extraction (3 points)

Create the useful features from the text column to train the model

For example:
* Length of the tweet
* No. of hashtags in the tweet starting with '#'
* No. of mentions in the tweet starting with '@'

Hint: create a new column for each of the above features

* create a column "Length of tweet" using `udf` function

In [None]:
# CODE HERE
length = udf(lambda x: len(x))

dataset_filtered = dataset_filtered.withColumn('tweet_length', length(dataset_filtered['text_processed']).astype('int'))
dataset_filtered.show()

* Create a new column "No.of Hashtags" in each tweet

In [None]:
num_hashtags = udf(lambda x : len(re.compile(r"#(\w+)").findall(x)))

dataset_filtered = dataset_filtered.withColumn('num_hashtags', num_hashtags(dataset_filtered['text']).astype('int'))
dataset_filtered.show()

* Create a new column "No.of Mentions" in each tweet

In [None]:
num_mentions = udf(lambda x : len(re.compile(r"@(\w+)").findall(x)))

dataset_filtered = dataset_filtered.withColumn('num_mentions', num_mentions(dataset_filtered['text']).astype('int'))
dataset_filtered.show()

* Create a new column "Punctuation Count" in each tweet

In [None]:
def punctCount1(text):
    sum = 0
    for i in text:
        if i in "!$%&'()*+,-./:;<=>?[\]^_`{|}~":
            sum +=1
    return sum

punctCount = udf(punctCount1)
dataset_filtered = dataset_filtered.withColumn('PunctCount',punctCount(dataset_filtered['text']).astype('int'))
dataset_filtered.select('PunctCount').show()

* Create a new column "Type of Punctuations" used in each tweet

In [None]:
def types_punctuation(text):
    return len(set([i for i in text if i in "!$%&'()*+,-./:;<=>?[\]^_`{|}~"]))

typePunct = udf(types_punctuation)
dataset_filtered = dataset_filtered.withColumn('typePunct',typePunct(dataset_filtered['text']).astype('int'))
dataset_filtered.show(10)

#### Get the features by applying CountVectorizer
CountVectorizer converts the list of tokens to vectors of token counts. See the [documentation](https://spark.apache.org/docs/latest/ml-features.html#countvectorizer) for details.

In [None]:
count = CountVectorizer(inputCol="wordss", outputCol="rawFeatures")
count_model = count.fit(dataset_filtered)
featurizedData = count_model.transform(dataset_filtered)
featurizedData.show(truncate=False)

#### Encode the labels

Using the `udf` function encode the string values of *airline_sentiment* to integers.

In [None]:
def LabelEncoder(x):
    if x == 'positive':
        return 0
    elif x == 'negative':
        return 1
    return 2
encoder = udf(LabelEncoder)
featurizedData = featurizedData.withColumn('label', encoder(featurizedData['airline_sentiment']).astype('int'))
featurizedData.show()

### Train the classifier the evaluate (1 point)

Create vector assembler with the selected features to train the model

In [None]:
featureassembler = VectorAssembler(inputCols=['rawFeatures','airline_sentiment_confidence','negativereason_confidence',
                                              'tweet_length','num_hashtags','num_mentions','retweet_count','PunctCount','typePunct']
                                   ,outputCol='features')
features = featureassembler.transform(featurizedData)
features.select('features').show(truncate=False)

#### Arrange features and label and split them into train and test.

In [None]:
featuresAndLabels = features.withColumn('labels',featurizedData['label'])
final = featuresAndLabels.select('features','labels')
final.show()

In [None]:
train_data,test_data = final.randomSplit([0.75,0.25])

#### Train the model with train data and make predictions on the test data

For classification of text data, implement NaiveBayes classifier. It is a probabilistic machine learning model.

For more information about **NaiveBayes Classifier**, click [here](https://spark.apache.org/docs/latest/ml-classification-regression.html#naive-bayes)

In [None]:
nb = NaiveBayes(featuresCol='features', labelCol='labels')
model = nb.fit(train_data)

In [None]:
# get the predictions
pred_results = model.transform(test_data)
pred_results.select('prediction').show(10)

#### Evaluate the model and find the accuracy

Compare the labels and predictions and find how many are correct.

To find the accuracy, get the count of correct predictions from test data and divide by the total amount of test dataset.

**Hint:** convert the predictions dataframe to pandas and compare with labels

In [None]:
# converting to pandas df
preds = pred_results.select('labels','prediction').toPandas()
preds

In [None]:
# comparing labels and predictions of test data
(preds['labels'] == preds['prediction']).sum() / len(preds)

### Deployment (1 point)

Let's integrate all the above code snippets in app.py and run it with **Streamlit**.

From the start (data loading step), place every code in app.py including data preprocessing, feature extraction and model training.

* implement the `predict_users_Input()` function which takes one tweet input from user and returns the prediction using the trained model.

* use the same preprocessing techniques and features extraction used for train data on user input.

* user input can be captured from the textbox from **Streamlit** app. Action is triggered when predict button is clicked and user input is classified using `predict_users_Input()` function.

For More information about Streamlit, click [here](https://docs.streamlit.io/en/stable/)

In [None]:
!pip install -qq streamlit
!pip install -qq colab-everything

In [None]:
%%writefile app.py
import streamlit as st
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import re
import string
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.classification import NaiveBayes
from pyspark.sql.types import ArrayType, StringType
from handyspark import *

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

st.write("Creating a spark session :heavy_check_mark:")
spark = SparkSession.builder.appName('TwitterSentiment').getOrCreate()
dataset = spark.read.csv("/content/US_Airline_Tweets.csv",inferSchema=True,header=True)
dataset_filtered = dataset.filter(dataset.text.isNotNull())
fillNull = udf(lambda x: 0 if x == None else x)
dataset_filtered = dataset_filtered.withColumn('negativereason_confidence',fillNull(dataset_filtered['negativereason_confidence']).astype('int'))
dataset_filtered = dataset_filtered.withColumn('airline_sentiment_confidence',fillNull(dataset_filtered['airline_sentiment_confidence']).astype('int'))

hdf = dataset.toHandy()

def words_process(text):
    text = text.lower() # lowercase
    text.replace(r'http?://[^\s<>"]+|www\.[^\s<>"]+', '') # Removing hyperlinks from all the tweets
    text.replace('\d+', '') # Removing numbers from all the tweets
    text = " ".join([i for i in text.split() if i.find("@")== -1]) # removing usernames
    text = text.replace('#','') # Removing hashtags, including the text, from all the tweets
    return text

words = udf(words_process)
dataset_filtered = dataset_filtered.withColumn('text_processed',words(dataset_filtered['text']))

word_udf = udf(lambda x: word_tokenize(x), ArrayType(StringType()))
dataset_filtered = dataset_filtered.withColumn("wordss", word_udf("text_processed"))
stop_words = set(stopwords.words('english'))
punct_udf1 = udf(lambda x: [w for w in x if not w in stop_words])
dataset_filtered = dataset_filtered.withColumn("wordss", punct_udf1("wordss"))

array_udf = udf(lambda x: x, ArrayType(StringType()))
dataset_filtered = dataset_filtered.withColumn("wordss", array_udf("wordss"))

def lemmatize(text_arr):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in text_arr]
lem = udf(lemmatize)
dataset_filtered = dataset_filtered.withColumn("wordss", lem("wordss"))
st.write("Preprocessing Done! :heavy_check_mark:")

array_udf = udf(lambda x: x, ArrayType(StringType()))
dataset_filtered = dataset_filtered.withColumn("wordss", array_udf("wordss"))

length = udf(lambda x: len(x))
dataset_filtered = dataset_filtered.withColumn('tweet_length', length(dataset_filtered['text']).astype('int'))
num_hashtags = udf(lambda x : len(re.compile(r"#(\w+)").findall(x)) if len(re.compile(r"#(\w+)").findall(x)) > 0 else 0)

dataset_filtered = dataset_filtered.withColumn('num_hashtags', num_hashtags(dataset_filtered['text']).astype('int'))
num_mentions = udf(lambda x : len(re.compile(r"@(\w+)").findall(x)) if len(re.compile(r"@(\w+)").findall(x)) > 0 else 0)

dataset_filtered = dataset_filtered.withColumn('num_mentions', num_mentions(dataset_filtered['text']).astype('int'))
def punctCount1(text):
    sum = 0
    for i in text:
        if i in "!$%&'()*+,-./:;<=>?[\]^_`{|}~":
            sum +=1
    return sum

punctCount = udf(punctCount1)

dataset_filtered = dataset_filtered.withColumn('PunctCount',punctCount(dataset_filtered['text']).astype('int'))
def types_punctuation(text):
    return len(set([i for i in text if i in "!$%&'()*+,-./:;<=>?[\]^_`{|}~"]))

typePunct = udf(types_punctuation)
dataset_filtered = dataset_filtered.withColumn('typePunct',typePunct(dataset_filtered['text']).astype('int'))

st.write("Ongoing feature extraction!! :heavy_check_mark:")
count = CountVectorizer(inputCol="wordss", outputCol="rawFeatures")
count_model = count.fit(dataset_filtered)
featurizedData = count_model.transform(dataset_filtered)

def LabelEncoder(x):
    if x == 'positive':
        return 0
    elif x == 'negative':
        return 1
    return 2
encoder = udf(LabelEncoder)
featurizedData = featurizedData.withColumn('label', encoder(featurizedData['airline_sentiment']).astype('int'))

featureassembler = VectorAssembler(inputCols=['rawFeatures','airline_sentiment_confidence','negativereason_confidence',
                                              'tweet_length','num_hashtags','num_mentions','retweet_count','PunctCount','typePunct']
                                   ,outputCol='features')
features = featureassembler.transform(featurizedData)

final = features.withColumn('labels',featurizedData['label'])
train_data,test_data = final.randomSplit([0.75,0.25])
nb = NaiveBayes(featuresCol='features', labelCol='labels')
st.write("Training the model :heavy_check_mark:")

model = nb.fit(train_data)

def predict_users_Input(user_input):
  df1 = spark.createDataFrame([ (1, user_input)],['Id', 'UserTweet'])

  df1 = df1.withColumn('UserTweet',words(df1['UserTweet']))
  df1 = df1.withColumn("wordss", word_udf("UserTweet"))
  df1 = df1.withColumn("wordss", punct_udf1("wordss"))
  df1 = df1.withColumn("wordss", array_udf("wordss"))
  df1 = df1.withColumn("wordss", lem("wordss"))
  df1 = df1.withColumn("wordss", array_udf("wordss"))

  df1 = df1.withColumn('tweet_length', length(df1['UserTweet']).astype('int'))
  df1 = df1.withColumn('num_hashtags', num_hashtags(df1['UserTweet']).astype('int'))
  df1 = df1.withColumn('num_mentions', num_mentions(df1['UserTweet']).astype('int'))
  df1 = df1.withColumn('PunctCount',punctCount(df1['UserTweet']).astype('int'))
  df1 = df1.withColumn('typePunct',typePunct(df1['UserTweet']).astype('int'))
  make  = udf(lambda x : 0)
  df1 = df1.withColumn('negativereason_confidence',make(df1['UserTweet']).astype('int'))
  df1 = df1.withColumn('airline_sentiment_confidence',make(df1['UserTweet']).astype('int'))
  df1 = df1.withColumn('retweet_count',make(df1['UserTweet']).astype('int'))
  df1_featured = count_model.transform(df1)
  test_features = featureassembler.transform(df1_featured)
  test_predict = model.transform(test_features)
  df_res = test_predict.select('prediction').toPandas()
  return df_res

def decode(label):
  if label == 0:
    return "Positive tweet!"
  elif label == 1:
    return "Negative Tweet!"
  return "Neutral tweet"

user_input = st.text_input("Enter the text input","Your tweet here ")
if st.button('predict'):
    result = predict_users_Input(user_input)
    st.write(decode(result.prediction.values[0]))

After you execute the code below you will get a web app link where you could perform the sentiment prediction task.

The cell below keeps executing until the server is stopped by interrupting the execution. An error message may appear upon interruption, you could ignore it.

In [None]:
from colab_everything import ColabStreamlit
ColabStreamlit('/content/app.py')


Refer the screenshot below.
![img](https://cdn.iisc.talentsprint.com/CDS/MiniProjects/sentiment_analysis_streamlit_button.JPG)