# Data_Acquring_and_Data_Exploration:

                                            
# Overview

Data Source: http://help.sentiment140.com/for-students

Here we are using Sentiment140 datasets for Academic purpose in order to train our supervised machine learning models for performing sentimental analysis on "Ukraine Crisis" Dataset.

We have uploaded the test and train datasets into mongoDB Compass which we obtained from "sentiment140" website.

The data is a CSV with emoticons removed. Also, there are no null values in this dataset.

Data file format has 6 fields/columns:

- 0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
- 1 - the id of the tweet (2087)
- 2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
- 3 - the query (lyx). If there is no query, then this value is NO_QUERY.
- 4 - the user that tweeted (robotickilldozr)
- 5 - the text of the tweet (Lyx is cool)

We have used magic command (%%time) in Jupyter notebooks which is used to measure the execution time of a code block.

# Initializing spark instance from a JupyterLab environment:


In [1]:
%%time

import findspark
findspark.init()

CPU times: total: 0 ns
Wall time: 7.66 ms



# Importing the necessary libraries:

In [2]:
%%time

import pyspark
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql import SparkSession
from pyspark import StorageLevel
from pyspark.sql import functions as f
from pyspark.sql.functions import col

CPU times: total: 46.9 ms
Wall time: 196 ms


# Creating a spark session:

In [3]:
%%time

spark = SparkSession \
        .builder \
        .appName("Data Acquring and exploration") \
        .config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.12:3.0.1") \
        .config("spark.driver.memory","40g") \
        .config("spark.executor.memory","50g") \
        .master("local") \
        .getOrCreate()


CPU times: total: 0 ns
Wall time: 14.3 s


# Defining mongoDB compass ip address:

In [4]:
%%time

mongo_ip = "mongodb://localhost:27017/streaming."

CPU times: total: 0 ns
Wall time: 0 ns


# Reading the test_nlp_data collection from the mongoDB which is in the database by the name streaming and counting number of rows in this dataframe.

In [5]:
%%time

testdf = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri", mongo_ip + "test_nlp_data").load()

testdf.count()

CPU times: total: 0 ns
Wall time: 3.82 s


498

# Applying proper schema to our test dataset:

In [6]:
%%time

# apply the new schema to the testdf DataFrame

testdf = testdf \
    .withColumn("polarity", col("_c0").cast("float")) \
    .withColumn("id", col("_c1").cast("long")) \
    .withColumn("date_time", col("_c2").cast("string")) \
    .withColumn("query", col("_c3").cast("string")) \
    .withColumn("user", col("_c4").cast("string")) \
    .withColumn("text", col("_c5").cast("string")) \
    .drop("_c0", "_c1", "_c2", "_c3", "_c4", "_c5")

CPU times: total: 0 ns
Wall time: 164 ms


# Performing grouping operation on polarity column to see number of records for each polarity type.

In [7]:
%%time

testdf.groupBy("polarity").count().show()

+--------+-----+
|polarity|count|
+--------+-----+
|     2.0|  139|
|     4.0|  182|
|     0.0|  177|
+--------+-----+

CPU times: total: 15.6 ms
Wall time: 745 ms



# Reading the train_nlp_data collection from the mongoDB which is in the database by the name streaming

In [8]:
%%time

traindf = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri", mongo_ip + "train_nlp_data").load()

traindf.count()

CPU times: total: 0 ns
Wall time: 12.5 s


1600000


# Apply proper schema to the traindf DataFrame

In [9]:
%%time

traindf = traindf \
    .withColumn("polarity", col("_c0").cast("float")) \
    .withColumn("id", col("_c1").cast("long")) \
    .withColumn("date_time", col("_c2").cast("string")) \
    .withColumn("query", col("_c3").cast("string")) \
    .withColumn("user", col("_c4").cast("string")) \
    .withColumn("text", col("_c5").cast("string")) \
    .drop("_c0", "_c1", "_c2", "_c3", "_c4", "_c5")

CPU times: total: 0 ns
Wall time: 50.4 ms


# Creating Temp view on the train dataset in order to perform sparkSQL operations.

In [10]:
%%time

traindf.createOrReplaceTempView("traindf")

traindf_filtered = spark.sql("SELECT polarity,text,date_time FROM traindf;")

CPU times: total: 0 ns
Wall time: 60.4 ms


# Performing grouping operation on polarity column to see number of records for each polarity type.  

In [12]:
%%time

traindf_sql = spark.sql("SELECT polarity,count(*) FROM traindf group by polarity;")

traindf_sql.show()

+--------+--------+
|polarity|count(1)|
+--------+--------+
|     0.0|  800000|
|     4.0|  800000|
+--------+--------+

CPU times: total: 0 ns
Wall time: 7.51 s


# Creating a new collection in mongoDB by the name traindf_reduced for storing the values of traindf_reduced dataframe.

In [13]:
%%time

traindf_filtered.repartition(20).write \
  .format("com.mongodb.spark.sql.DefaultSource") \
  .mode("append") \
  .option("uri", mongo_ip + "traindf_filtered") \
  .option("partitioner", "MongoSinglePartitioner") \
  .option("partitionKey", "polarity") \
  .save()

CPU times: total: 0 ns
Wall time: 21.1 s


# Inference:

We can not use test dataset since test and train datasets are different. Train data set is having 1.6M rows and the polarity is only there for negative (0) and positive(4). The more number of records we have our model will be more accurate.

Since we are going to use training dataset from sentiment140 our model will be trained to detect only 0 (Negative Sentiment) and 4 (Positive Sentiment).