# Sentimental Analysis on historical data.

Data Source: https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows

This data is extracted from the twitter by a user who has collected live twitter data on the ongoing Russia-Ukrain Crisis. We have taken only few .csv files containing total of 3M rows for our Sentimental Analysis.

The data is stored in mongoDB and not in the local storage and we will be performing Data Wrangling, NLP Preprocessing, and finally applying our logistic regression model to predict the sentiment of the users.


# Our final Logistic Regression model which will be deployed on our historical dataset

In [56]:
final_logistic_model

PipelineModel_16ec05e33491

# Importing our data from mongoDB 

In [57]:
%%time

mongo_ip = "mongodb://localhost:27017/streaming."

df_war = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri", mongo_ip + "Ukraine_Russia_Crisis").load()

df_war.createOrReplaceTempView("df_war_cleaned")

df_war = spark.sql("select text from df_war_cleaned;")


CPU times: total: 15.6 ms
Wall time: 5.81 s


In [52]:
df_war.printSchema() # checking the schema of the imported dataset

root
 |-- text: string (nullable = true)



# Creating a function and UDF to perform data wrangling where we will be removing unwanted characters otherwise it will impact the performance of our regression model.

In [47]:
%%time

import html

@f.udf
def html_unescape(s: str):
    if isinstance(s, str):
        return html.unescape(s)
    return s

user_regex = r"(@\w{1,15})"
hashtag_regex = r"(#\w{1,})"
url_regex = r"((https?|ftp|file):\/{2,3})+([-\w+&@#/%=~|$?!:,.]*)|(www.)+([-\w+&@#/%=~|$?!:,.]*)"
email_regex = r"[\w.-]+@[\w.-]+\.[a-zA-Z]{1,}"

def clean_data(df):
    df = (
        df
        .withColumn("text", f.regexp_replace(f.col("text"), url_regex, "")) # replacing urls with empty string
        .withColumn("text", f.regexp_replace(f.col("text"), email_regex, "")) # replacing email with empty strings
        .withColumn("text", f.regexp_replace(f.col("text"), user_regex, "")) # replacing @<user_name> with empty string
        .withColumn("text", f.regexp_replace(f.col("text"), "#", " ")) # replacing '#' with space
        .withColumn("text", html_unescape(f.col("text")))  # removing html using UDF 
        .withColumn("text", f.regexp_replace(f.col("text"), "[^a-zA-Z']", " ")) # remove all numbers
        .withColumn("text", f.regexp_replace(f.col("text"), r'\s{1,}', ' ')) # replace consecutive spaces (1 to any number) with a single space
        .withColumn("text", f.trim(f.col("text"))) # removing leading and trailing whitespaces
        .filter("text != ''") # removing empty strings
    )
    return df


CPU times: total: 0 ns
Wall time: 0 ns


# Applying cleaning logic on our dataset

In [48]:
%%time

df_war = clean_data(df_war)

CPU times: total: 15.6 ms
Wall time: 72.4 ms


# Apply the logistic regression model to the new data

In [49]:
%%time

from pyspark.ml.classification import LogisticRegressionModel

predictions = final_logistic_model.transform(df_war)


CPU times: total: 15.6 ms
Wall time: 92.7 ms


# Selecting desired columns for our predictions dataset.

In [54]:
%%time
predictions.createOrReplaceTempView("prediction_sql")

ukraine_war_df= spark.sql("select prediction,text from prediction_sql;")


CPU times: total: 0 ns
Wall time: 31 ms


# Identifying what type of sentiments the users have based ML model predictions on our dataset.

In [53]:
ukraine_war_df.groupBy("prediction").count().show()

+----------+------+
|prediction| count|
+----------+------+
|       0.0|910194|
|       4.0|533729|
+----------+------+



# Business Conclusion:

Based on the 3M rows of twitter data we have analyzed we can say that people are having negative sentiment on the ongoing Russia-Ukraine crisis. This prediction is based on our logistic regression model which is having accuracy of 79%. Moreover, our model accuracy is highly dependent on Sentiment140 dataset which we used to train our model to perform supervised machine learning.

# Storing the sentimental data in the mongoDB

In [55]:
%%time

ukraine_war_df.write.format("com.mongodb.spark.sql.DefaultSource") \
  .option("uri", mongo_ip + "Ukraine_Crisis_Sentement") \
  .mode("append") \
  .save()


CPU times: total: 93.8 ms
Wall time: 7min 28s


In [58]:
%%time

ukraine_war_df.write.format("com.mongodb.spark.sql.DefaultSource") \
  .option("uri", mongo_ip + "Ukraine_Crisis_Sentement_demo") \
  .mode("append") \
  .save()


CPU times: total: 0 ns
Wall time: 6min 15s
