## <font color='#0000FF'>MSC_DA_CA2 - INTEGRATEDCA2<font color='#1ABC9D'>
### <font color='#'>**Advanced Data Analytics  & Big Data Storage and Processing**
### <font color='#1ABC9C'>**Lecturer(s): David McQuaid and Muhammad Iqbal**
------
<font color='#1ABC9C'>**Student name / ID** // Rosilene Francisca da Silva - 2021090

### Data Integration and Preprocessing: Leveraging Apache Spark for Populating MySQL Databases with Large Datasets.

### Configure Apache Spark
Start Spark Session:
Set up PySpark, including necessary packages for MySQL connections.

"The dataset used in this analysis, ProjectTweets.csv, was provided by Professor [McQuaid] via the Moodle [https://moodle.cct.ie/mod/assign/view.php?id=44089]course [MSc in Data Analytics] CCT College Dublin on 18 April 2024."

The dataset is guaranteed to have an organised schema, dependable transactional integrity, strong query capabilities, and simple integration with Apache Spark by selecting MySQL over a NoSQL database. All of these advantages combine to make MySQL the superior option for this particular task of populating the information and doing further analytical processing.

In [None]:
Don't run

In [3]:
spark.stop()

In [4]:
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
import os

spark = SparkSession.builder \
    .appName("Database Comparative Analysis") \
    .config("spark.sql.legacy.timeParserPolicy", "LEGACY") \
    .config("spark.driver.memory", "8g") \
    .config("spark.executor.memory", "8g") \
    .config("spark.executor.cores", "4") \
    .config("spark.dynamicAllocation.enabled", "true") \
    .config("spark.dynamicAllocation.minExecutors", "2") \
    .config("spark.dynamicAllocation.maxExecutors", "100") \
    .config("spark.sql.shuffle.partitions", "1000") \
    .config("spark.jars", "/home/hduser/Downloads/mysql-connector-j-8.0.33.jar") \
    .getOrCreate()

24/05/11 19:12:25 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/05/11 19:12:25 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


In [5]:
spark

In [6]:
import py4j
print(py4j.__version__)

0.10.9.7


Rationale:
This PySpark setup is designed to process data during a database comparison in an effective and scalable manner. With the application name "Database Comparative Analysis," the SparkSession is created, and backward compatibility with timestamp formats is guaranteed by the legacy time parser policy. In order to efficiently handle huge data volumes, the memory configurations for the executors and driver are set to 8 GB each. Moreover, having 4 cores per executor optimises parallel processing.

By dynamically allocating resources according to workload requirements, the cluster can grow from a minimum of 2 executors to a maximum of 100. Large dataset data shuffling speed is enhanced by increasing the shuffle divisions to 1000, which reduces execution time. 

Furthermore, smooth connectivity between Spark and MySQL for direct data intake and analysis is ensured by including the MySQL JDBC connector jar. This setup offers a scalable and highly effective environment appropriate for large-scale database comparison analysis projects.

#### Before attempting to read the data, let's test reading data or just establish a connection to identify if has any issue

In [7]:
# Test loading a simple query to ensure connectivity
try:
    jdbc_df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/twitter_data").option(
        "driver", "com.mysql.cj.jdbc.Driver").option("dbtable", "tweet_details").option(
        "user", "root").option("password", "password").load()
    jdbc_df.show(1)
    print("Connection established successfully.")
    
except Exception as e:
    print("An error occurred:", e)

[Stage 0:>                                                          (0 + 1) / 1]

+---+----------+-------------------+------------+-------------+--------------------+
| id|   user_id|               date|query_status|  user_handle|          tweet_text|
+---+----------+-------------------+------------+-------------+--------------------+
|  1|1467810672|2009-04-06 22:19:49|    NO_QUERY|scotthamilton|is upset that he ...|
+---+----------+-------------------+------------+-------------+--------------------+
only showing top 1 row

Connection established successfully.


                                                                                

### Load the data from MySQL into a new DataFrame
The Database `twitter_data` was created on MySQL following the `tweet_details`table.  
Also defined and renamed the column names as "id", "user_id", "date", "query_status", "user_handle", and "tweet_text".

In [8]:
# Load the data from MySQL into a new DataFrame
mysql_df = spark.read.format("jdbc").options(
    url="jdbc:mysql://localhost/twitter_data",
    driver="com.mysql.cj.jdbc.Driver",
    dbtable="tweet_details",
    user="root",
    password="password"
).load()

# Show the first few rows of the DataFrame to verify the data
mysql_df.show(2)

[Stage 1:>                                                          (0 + 1) / 1]

+---+----------+-------------------+------------+-------------+--------------------+
| id|   user_id|               date|query_status|  user_handle|          tweet_text|
+---+----------+-------------------+------------+-------------+--------------------+
|  1|1467810672|2009-04-06 22:19:49|    NO_QUERY|scotthamilton|is upset that he ...|
|  2|1467810917|2009-04-06 22:19:53|    NO_QUERY|     mattycus|@Kenichan I dived...|
+---+----------+-------------------+------------+-------------+--------------------+
only showing top 2 rows



                                                                                

This code loaded the data from the tweet_details table in MySQL into a new DataFrame called mysql_df. As the data was displayed correctly, it indicates that the data was successfully read to MySQL from PySpark.

#### Verify the DataFrame
After load the dataset, it’s a good idea to check the type and first few rows of the DataFrame to ensure that everything is loaded correctly.

In [9]:
# Print DataFrame schema
mysql_df.printSchema()

mysql_df.show(5, truncate=True)

root
 |-- id: integer (nullable = true)
 |-- user_id: long (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- query_status: string (nullable = true)
 |-- user_handle: string (nullable = true)
 |-- tweet_text: string (nullable = true)



[Stage 2:>                                                          (0 + 1) / 1]

+---+----------+-------------------+------------+-------------+--------------------+
| id|   user_id|               date|query_status|  user_handle|          tweet_text|
+---+----------+-------------------+------------+-------------+--------------------+
|  1|1467810672|2009-04-06 22:19:49|    NO_QUERY|scotthamilton|is upset that he ...|
|  2|1467810917|2009-04-06 22:19:53|    NO_QUERY|     mattycus|@Kenichan I dived...|
|  3|1467811184|2009-04-06 22:19:57|    NO_QUERY|      ElleCTF|my whole body fee...|
|  4|1467811193|2009-04-06 22:19:57|    NO_QUERY|       Karoli|@nationwideclass ...|
|  5|1467811372|2009-04-06 22:20:00|    NO_QUERY|     joy_wolf|@Kwesidei not the...|
+---+----------+-------------------+------------+-------------+--------------------+
only showing top 5 rows



                                                                                

#### Dataframe Information
Show the number of rows and columns.

In [10]:
num_rows = mysql_df.count()
num_columns = len(mysql_df.columns)
print(f"Number of rows: {num_rows}, Number of columns: {num_columns}")

[Stage 3:>                                                          (0 + 1) / 1]

Number of rows: 1599999, Number of columns: 6


                                                                                

With 1,599,999 rows and 7 columns overall, the dataset has a sizable amount of data that may be thoroughly examined. Each row contains particular information possibly connected to tweets, with seven columns reflecting different dataset aspects. These features include numerical and categorical data. This dataset's extensive structure makes it possible to perform significant exploratory data analysis (EDA) and gain important insights about user behaviour and data patterns. It also makes it possible to perform more complex analysis or machine learning activities.

### Data Pre-Processing in PySpark

#### Checking for missing Values

In [11]:
# Initialize a flag to track whether missing values are found
from pyspark.sql.functions import col
missing_values_found = False

for column in mysql_df.columns:
    missing_count = mysql_df.filter(col(column).isNull() | (col(column) == '')).count()
    if missing_count > 0:
        print(f"Column {column} has {missing_count} missing values")
        missing_values_found = True

# Check the flag after checking all columns, and print a message if no missing values were found
if not missing_values_found:
    print("No missing values found in any column.")

[Stage 21:>                                                         (0 + 1) / 1]

No missing values found in any column.


                                                                                

Based on output the dataset there is no missing values.

In [12]:
# Displaying dtypes of columns
mysql_df.dtypes

[('id', 'int'),
 ('user_id', 'bigint'),
 ('date', 'timestamp'),
 ('query_status', 'string'),
 ('user_handle', 'string'),
 ('tweet_text', 'string')]

From the output the DataFrame schema, the columns `id` and `user_id` are numerical. 
The column `date`  has a type of timestamp, meaning it's a timestamp (or datetime) field, columns like `query_status` `user_handle`, and `tweet_text` has a type of string, meaning it's a textual field (or varchar in SQL).

#### Displaying some columns

In [13]:
mysql_df.select('user_id').show(n=5, truncate=False)

[Stage 24:>                                                         (0 + 1) / 1]

+----------+
|user_id   |
+----------+
|1467810672|
|1467810917|
|1467811184|
|1467811193|
|1467811372|
+----------+
only showing top 5 rows



                                                                                

In [14]:
mysql_df.select('date').show(n=5, truncate=False)

                                                                                

+-------------------+
|date               |
+-------------------+
|2009-04-06 22:19:49|
|2009-04-06 22:19:53|
|2009-04-06 22:19:57|
|2009-04-06 22:19:57|
|2009-04-06 22:20:00|
+-------------------+
only showing top 5 rows



The output illustrates the temporal character of the data by showing the first five rows of the date column from the dataset. Each entry in the date column is a timestamp indicating when a particular event, such as a tweet, occurred.

#### Using the.agg() method and the aggregation functions min and max to find the earliest and latest dates will print the first and last tweet dates. 

In [15]:
# Import Aggregation Functions
from pyspark.sql import functions as F

# Finding the first and last tweet dates
first_last_dates = mysql_df.agg(
    F.min("date").alias("first_date"),
    F.max("date").alias("last_date")
).collect()[0]

first_date = first_last_dates["first_date"]
last_date = first_last_dates["last_date"]

print(f"First tweet date: {first_date}")
print(f"Last tweet date: {last_date}")

[Stage 26:>                                                         (0 + 1) / 1]

First tweet date: 2009-04-06 22:19:49
Last tweet date: 2009-06-25 10:28:31


                                                                                

The earliest and latest timestamps for tweets suggest that the dataset covers the period of April 6, 2009, to June 25, 2009. The last tweet was sent out on June 25, 2009, at 10:28:31, and the first one was sent out on April 6, 2009, at 22:19:49. 

According to this, the dataset includes tweets from over three months' worth of time, giving it a temporal range that is appropriate for studying tweet patterns and trends during that time. The outcome helps to clarify the historical bounds of the dataset and directs further data analysis techniques.

In [16]:
# Selecting multiple columns
mysql_df.select(['user_handle','tweet_text']).show(n=5, truncate=False)

[Stage 29:>                                                         (0 + 1) / 1]

+-------------+---------------------------------------------------------------------------------------------------------------+
|user_handle  |tweet_text                                                                                                     |
+-------------+---------------------------------------------------------------------------------------------------------------+
|scotthamilton|is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!|
|mattycus     |@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds                      |
|ElleCTF      |my whole body feels itchy and like its on fire                                                                 |
|Karoli       |@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there. |
|joy_wolf     |@Kwesidei not the whole crew                                                             

                                                                                

The dataset's first rows provide insight into the `user_handle` and `tweet_text` columns. The user_handle column contains Twitter handles that indicate who wrote each tweet, but the tweet_text column has the actual text content of each tweet. For example, user scotthamilton complains about not being able to update Facebook via text, whereas mattycus describes diving for a ball. Other tweets show personal discomfort (ElleCTF), bewilderment (Karoli), and interactions with other users (joy_wolf). This selection captures various user moods and activities on Twitter within the dataset's duration, emphasising the diversity of content, which ranges from personal updates to social interactions. The truncate=False argument guarantees that the complete content of each tweet is displayed without truncation, providing a preview of the user's tweet interactions.

#### Data Distribution
Count distinct values of a categorical column.

In [19]:
# Selecting 'query_status' column
mysql_df.groupBy("query_status").count().show(10)

[Stage 36:>                                                         (0 + 1) / 1]

+------------+-------+
|query_status|  count|
+------------+-------+
|    NO_QUERY|1599999|
+------------+-------+



                                                                                

The dataset's `query_status` column. In this instance, the output indicates that the query_status of NO_QUERY applies to all 1,599,999 tweets. This suggests that the query_status column has a consistent value across all rows, indicating that the field was not meaningfully used in the dataset or that none of the tweets were connected to any particular search query. The consistency shows that, absent other columns or contextual information, the column might not be pertinent for more investigation. All things considered, this discovery can direct data analysts to concentrate on other columns for significant insights while taking into account possible drop this column from further studies.

In [21]:
# Selecting 'user_handle' column
mysql_df.groupBy("user_handle").count().show(10)

[Stage 42:>                                                         (0 + 1) / 1]

+---------------+-----+
|    user_handle|count|
+---------------+-----+
|     megan_rice|   15|
|         MeghTW|    1|
|stranger_danger|   14|
|       kyrabeth|    1|
|    lovelylivxo|   16|
|      tink68113|    1|
|     Svalentyna|    1|
|     bakerbelle|    1|
|  somethingalex|    1|
|     sexy_ass_T|    1|
+---------------+-----+
only showing top 10 rows



                                                                                

The tweet count of distinct users is calculated by grouping the dataset by the `user_handle` column and counting the number of tweets linked with each user. The first ten rows of the output reveal a unique Twitter user and the number of tweets they have sent. For example, megan_rice has 15 tweets, stranger_danger has 14, and lovelylivxo has 16, yet MeghTW, kyrabeth, so on each have only one tweet. This distribution shows that, while some individuals are frequent tweeters, others have a low presence in the dataset. 

#### Summary Statistics
Generate summary statistics

In [22]:
# Basic statistics
mysql_df.describe().show()

24/05/11 19:16:41 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
[Stage 45:>                                                         (0 + 1) / 1]

+-------+-----------------+--------------------+------------+--------------------+--------------------+
|summary|               id|             user_id|query_status|         user_handle|          tweet_text|
+-------+-----------------+--------------------+------------+--------------------+--------------------+
|  count|          1599999|             1599999|     1599999|             1599999|             1599999|
|   mean|         800000.0|1.9988178841753244E9|        null| 4.325887521835714E9|                null|
| stddev|461880.0710141109| 1.935756789172917E8|        null|5.162733218454889E10|                null|
|    min|                1|          1467810672|    NO_QUERY|        000catnap000|                 ...|
|    max|          1599999|          2329205794|    NO_QUERY|          zzzzeus111|ï¿½ï¿½ï¿½ï¿½ï¿½ß§...|
+-------+-----------------+--------------------+------------+--------------------+--------------------+



                                                                                

The gives summary statistics for the dataset, shedding light on the distribution and features of each column. 

With a mean of 800,000.0 and a standard deviation of roughly 461,880, the id column shows a well-distributed set of unique identifiers ranging from 1 to 1,599,999. With a mean of almost 1.998 billion, the user_id column, which represents unique user identifiers, ranges from 1,467,810,672 to 2,329,205,794. All rows in the query_status column have the same value, NO_QUERY. The Twitter `user_handle` column span from 000catnap000 to zzzzeus111, indicating a wide range of users. 
Finally, a variety of text data are contained in the tweet_text column, including special characters that are evident in the maximum value ï¿½ï¿½ï¿½ï¿½ß§..., which may be indicative of encoding problems. A glimpse of the data distribution is given by the summary, which shows that although most columns are diverse and well-populated, the `query_status` column is uniform, and special characters would need to be cleaned up.

#### Correlation Analysis
Find correlations between numerical columns.

In [23]:
# List of numerical columns
numerical_columns = [col for col, dtype in mysql_df.dtypes if dtype in ('int', 'bigint', 'double', 'float')]
print(f"Numerical columns: {numerical_columns}")

Numerical columns: ['id', 'user_id']


In [24]:
# Summary statistics only for numerical columns
mysql_df.select(numerical_columns).describe().show()

[Stage 48:>                                                         (0 + 1) / 1]

+-------+-----------------+--------------------+
|summary|               id|             user_id|
+-------+-----------------+--------------------+
|  count|          1599999|             1599999|
|   mean|         800000.0|1.9988178841753244E9|
| stddev|461880.0710141109| 1.935756789172917E8|
|    min|                1|          1467810672|
|    max|          1599999|          2329205794|
+-------+-----------------+--------------------+



                                                                                

In [25]:
# Calculate correlations between numerical columns
from pyspark.sql.functions import corr

for column1 in numerical_columns:
    for column2 in numerical_columns:
        if column1 != column2:
            correlation = mysql_df.select(corr(column1, column2)).first()[0]
            print(f"Correlation between {column1} and {column2}: {correlation}")

                                                                                

Correlation between id and user_id: 0.22304937463402058


[Stage 54:>                                                         (0 + 1) / 1]

Correlation between user_id and id: 0.22304937463402053


                                                                                

There is roughly a 0.223 correlation between `id` and `user_id`. Although there is a linear link between the two columns, it is not very strong, as indicated by the slight positive correlation between id and user_id. This discovery may help direct future research or data modelling by providing insight into the distribution of data and user behaviour.

#### Adding a New Column with the current timestamp & Extract only those tweets that contain the keyword "summer" in the tweet_text column.
The `inserted_at` column will contain the timestamp indicating when each row was processed.

In [26]:
# Filtering tweets containing a specific keyword 'summer'
filtered_df = mysql_df.filter(mysql_df.tweet_text.contains("summer"))

# Add a current timestamp column for the insert time
from pyspark.sql.functions import current_timestamp

final_df = filtered_df.withColumn("inserted_at", current_timestamp())

# Select only the desired columns and print it
final_df.select("date", "tweet_text", "inserted_at").show(10, truncate=True)

[Stage 57:>                                                         (0 + 1) / 1]

+-------------------+--------------------+--------------------+
|               date|          tweet_text|         inserted_at|
+-------------------+--------------------+--------------------+
|2009-04-06 22:27:00|@jacobsummers Sor...|2024-05-11 19:17:...|
|2009-04-06 23:40:34|@kimmyawesome Ohh...|2024-05-11 19:17:...|
|2009-04-06 23:48:30|It's official! I'...|2024-05-11 19:17:...|
|2009-04-07 00:12:36|ok my TWEET PEEP ...|2024-05-11 19:17:...|
|2009-04-07 00:13:15|summer camp or su...|2024-05-11 19:17:...|
|2009-04-07 00:34:43|Downy weather  Wh...|2024-05-11 19:17:...|
|2009-04-07 00:35:12|Craaaaap. My Macb...|2024-05-11 19:17:...|
|2009-04-07 00:41:46|@dadi_iyal and yo...|2024-05-11 19:17:...|
|2009-04-07 01:57:26|@meatrack no more...|2024-05-11 19:17:...|
|2009-04-07 01:57:35|searching for a j...|2024-05-11 19:17:...|
+-------------------+--------------------+--------------------+
only showing top 10 rows



                                                                                

After filtering the dataset, this result displays the first ten rows of tweets that contain the keyword "summer". Each tweet is shown with its original timestamp (date), the tweet's text (tweet_text), and a new column (inserted_at) that shows the current timestamp when this information was processed. The date column providing information on the temporal distribution of tweets mentioning "summer." For example, one user is undecided between "summer camp or summer school," while another tweets about "Downy Weather." The `inserted_at` column indicates that the data was processed, distinguishing between the original tweet timestamps and the time of processing. 

This allows for the tracking of when the filtered data was curated. Overall, this result displays a wide range of attitudes and interests relating to "summer," from plans to technical concerns, providing useful insights into user interactions about this topic.

#### Word Frequency Count

In [27]:
# Frequency of words in the "tweet_text" column
from pyspark.sql.functions import explode, split
mysql_df.withColumn("word", explode(split(mysql_df["tweet_text"], "\s+"))).groupBy("word").count().orderBy("count", ascending=False).show(10)

[Stage 60:>                                                         (0 + 4) / 4]

+----+-------+
|word|  count|
+----+-------+
|    |1184159|
|  to| 552961|
|   I| 496616|
| the| 487501|
|   a| 366211|
|  my| 280025|
| and| 275263|
|   i| 249976|
|  is| 217692|
| you| 213871|
+----+-------+
only showing top 10 rows



                                                                                

To analyze text data, paying particular attention to word frequency in the "tweet_text" column. The DataFrame is then grouped according to the "word" column, and the count() method is used to determine how many times each word appears. The most common terms are then highlighted by sorting the results in descending order using the "count" column. Lastly, the top ten outcomes are shown. The output displays the word counts. Common English terms like "to," "I," and "the" are followed by the most often occurring word, 1,184,159 times (blank spaces, probably from many consecutive spaces or formatting errors in tweets).

Rationale: Understanding the dataset's typical word usage, identifying any data cleaning problems (such as the large number of blanks), and using the results as a foundation for additional text analysis tasks like sentiment analysis or theme modelling are all made possible by this analysis.

In [28]:
# Print current DataFrame columns
print(mysql_df.columns)

['id', 'user_id', 'date', 'query_status', 'user_handle', 'tweet_text']


In [29]:
# Displaying dtypes of columns
mysql_df.dtypes

[('id', 'int'),
 ('user_id', 'bigint'),
 ('date', 'timestamp'),
 ('query_status', 'string'),
 ('user_handle', 'string'),
 ('tweet_text', 'string')]

### Converting the dataset from MySQL to Pandas

#### Before converting the dataset to pandas, drop some unnecessary columns for further analysis

In [30]:
# Droping'id','query_status' columns
mysql_df = mysql_df.drop('id','query_status')

In [31]:
mysql_df.show(2)

[Stage 61:>                                                         (0 + 1) / 1]

+----------+-------------------+-------------+--------------------+
|   user_id|               date|  user_handle|          tweet_text|
+----------+-------------------+-------------+--------------------+
|1467810672|2009-04-06 22:19:49|scotthamilton|is upset that he ...|
|1467810917|2009-04-06 22:19:53|     mattycus|@Kenichan I dived...|
+----------+-------------------+-------------+--------------------+
only showing top 2 rows



                                                                                

#### Checking for missing Values again after dropped some columns

In [32]:
from pyspark.sql.functions import col, sum

null_count = mysql_df.select([sum(col(c).isNull().cast("int")).alias(c) for c in mysql_df.columns])
null_count.show()

[Stage 62:>                                                         (0 + 1) / 1]

+-------+----+-----------+----------+
|user_id|date|user_handle|tweet_text|
+-------+----+-----------+----------+
|      0|   0|          0|         0|
+-------+----+-----------+----------+



                                                                                

#### Collect the Dataset to Pandas 
Objective: Gather all rows from the Spark DataFrame (mysql_df) into a list of rows.

Functionality:
.collect(): Fetches all data rows from the distributed environment into the driver node as a list of Row objects.
Rationale: Because this operation requires sufficient memory and is suitable for relatively smaller datasets due to its memory-intensive nature, the memory configurations for the executors and driver were set from 4 GB to 8 GB each, and the configuration bellow on Spark in the beginning was changed, and unnecessary columns were removed before running the process.    
    .config("spark.executor.cores", "4") \
    .config("spark.dynamicAllocation.enabled", "true") \
    .config("spark.dynamicAllocation.minExecutors", "2") \
    .config("spark.dynamicAllocation.maxExecutors", "100") \
    .config("spark.sql.shuffle.partitions", "1000") \ 

In [33]:
# Collect the Dataset to Pandas
new_df = mysql_df.collect()

                                                                                

#### Create Pandas DataFrame from MySQL
Objective: Convert the list of rows into a Pandas DataFrame for further analysis.

Functionality: pd.DataFrame(new_df): Initializes a Pandas DataFrame from the list of rows.
twitter_df.columns = mysql_df.columns: Assigns the original column names from the Spark DataFrame (mysql_df) to the Pandas DataFrame.

In [34]:
#Create Pandas DataFrame from MySQL

import pandas as pd

twitter_df = pd.DataFrame(new_df)
twitter_df.columns =  mysql_df.columns
twitter_df.head(5)

Unnamed: 0,user_id,date,user_handle,tweet_text
0,1467810672,2009-04-06 22:19:49,scotthamilton,is upset that he can't update his Facebook by ...
1,1467810917,2009-04-06 22:19:53,mattycus,@Kenichan I dived many times for the ball. Man...
2,1467811184,2009-04-06 22:19:57,ElleCTF,my whole body feels itchy and like its on fire
3,1467811193,2009-04-06 22:19:57,Karoli,"@nationwideclass no, it's not behaving at all...."
4,1467811372,2009-04-06 22:20:00,joy_wolf,@Kwesidei not the whole crew


In [35]:
twitter_df.shape

(1599999, 4)

The techniques was applyed to collect a large dataset from a Spark DataFrame (mysql_df) to a Pandas DataFrame (twitter_df) for further analysis. .collect() gathers all rows from the Spark DataFrame into a list of Row objects, which are then transformed to a Pandas DataFrame. The resulting twitter_df has 1,599,999 rows and 4 columns, indicating that the dataset is large enough to be explored and manipulated with Pandas. However, due to the scale of the dataset, attention should be given with memory utilisation, as Pandas operations might be more memory-intensive than Spark DataFrames.

### Save the Dataset to CSV

In [36]:
# Save the Pandas DataFrame to a CSV file
twitter_df.to_csv('twitter_df.csv', index = False)

In [37]:
!ls -lh

total 401M
-rw-r--r-- 1 hduser hadoopgroup  53K May 11 19:17 'Integrated_CA2MScDA_ BD_ADA -BDS1.ipynb'
-rw-r--r-- 1 hduser hadoopgroup 104K May 11 19:18 'Integrated_CA2MScDA_ BD_ADA .ipynb'
drwxr-xr-x 4 hduser hadoopgroup 4.0K May  6 10:02  MSCCA12023V2
-rw-r--r-- 1 hduser hadoopgroup 2.4M May  5 15:20  mysql-connector-j-8.0.33.jar
-rw-r--r-- 1 hduser hadoopgroup 219M Apr 28 09:50  ProjectTweets.csv
-rw-r--r-- 1 hduser hadoopgroup   32 Apr 25 09:23  README.md
-rw-r--r-- 1 hduser hadoopgroup 180M May 11 19:20  twitter_df.csv


The collected  DataFrame (twitter_df) is saved to a CSV file named twitter_df.csv via the to_csv() method. The index=False argument ensures that the index column does not appear in the output. This method facilitates data export, allowing the dataset to be shared, analysed, or preserved in CSV format. The generated CSV file has 1,599,999 rows and four columns (user_id, date, user_handle, and tweet_text), offering a complete and portable snapshot of Twitter data for further analysis.

The command `!ls -lh` was applyed to list all files in the current directory with detailed information like size, permissions, owner, and timestamps in a readable format.

# Pandas

In [29]:
#Importing the libraries.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import style
from matplotlib import cm
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [30]:
#Command to display all columns in the file.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [31]:
# Importing the CSV Dataset 
twitter_df = pd.read_csv('twitter_df.csv') 
twitter_df.head() 

Unnamed: 0,user_id,date,user_handle,tweet_text
0,1467810672,2009-04-06 22:19:49,scotthamilton,is upset that he can't update his Facebook by ...
1,1467810917,2009-04-06 22:19:53,mattycus,@Kenichan I dived many times for the ball. Man...
2,1467811184,2009-04-06 22:19:57,ElleCTF,my whole body feels itchy and like its on fire
3,1467811193,2009-04-06 22:19:57,Karoli,"@nationwideclass no, it's not behaving at all...."
4,1467811372,2009-04-06 22:20:00,joy_wolf,@Kwesidei not the whole crew


In [32]:
twitter_df.shape

(1599999, 4)

In [33]:
twitter_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599999 entries, 0 to 1599998
Data columns (total 4 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   user_id      1599999 non-null  int64 
 1   date         1599999 non-null  object
 2   user_handle  1599999 non-null  object
 3   tweet_text   1599999 non-null  object
dtypes: int64(1), object(3)
memory usage: 48.8+ MB


Based on the first basic details, the dataset appears to be a collection of tweets, commonly can be use in sentiment analysis or other natural language processing tasks. 

Dataset Size and Integrity:
The dataset contains approximately 1.6 million entries (1599999 entries, indexed from 0 to 1599998) and 4 columns.
Each column has the same number of non-null entries (1599999), indicating there are no missing values in any of the columns.
Memory Usage: The dataframe is consuming approximately 48.8+ MB of memory, which is relevant for understanding the computation resources needed to process this dataset.

In [34]:
twitter_df.dtypes

user_id         int64
date           object
user_handle    object
tweet_text     object
dtype: object

In [35]:
print('length of data is', len(twitter_df))

length of data is 1599999


In [36]:
twitter_df.isnull().values.any()

False

#### Display the Unique Values

In [37]:
for column in twitter_df.columns:
    num_unique_values = twitter_df[column].nunique()
    print(f"Number of unique values in '{column}': {num_unique_values}")

Number of unique values in 'user_id': 1598314
Number of unique values in 'date': 774362
Number of unique values in 'user_handle': 659775
Number of unique values in 'tweet_text': 1581465


In [38]:
def return_unique_values(data_frame):
    unique_dataframe = pd.DataFrame()
    unique_dataframe['Features'] = data_frame.columns
    uniques = []
    for col in data_frame.columns:
        u = data_frame[col].nunique()
        uniques.append(u)
    unique_dataframe['Uniques'] = uniques
    return unique_dataframe

unidf = return_unique_values(twitter_df)
print(unidf)

      Features  Uniques
0      user_id  1598314
1         date   774362
2  user_handle   659775
3   tweet_text  1581465


### Data Exploration (EDA) & Data Preparation

#### DateTime Parsing - Convert Date Format
The `date` column contains date and time information as a object, let's to convert it into a DateTime for easier manipulation and to facilitate time series operations.

Given the `date` format is (2009-04-06 22:19:49), to guarantee reliable and precise parsing of the date column in the dataset, error handling was incorporated into the datetime format code. Pandas can directly interpret each part of the datetime strings by defining the format (%Y-%m-%d %H:%M:%S), which prevents misunderstandings and improves parsing efficiency. The function can handle values that don't fit the required format by converting them to NaT when errors='coerce' is used. This preserves processing continuity and makes it simple to identify problems with the quality of the data. This technique is particularly helpful when getting your data ready for accurate and trustworthy time-series analysis.

In [39]:
# Convert the 'date' column to datetime specifying the exact format
twitter_df['date'] = pd.to_datetime(twitter_df['date'], format='%Y-%m-%d %H:%M:%S', errors='coerce')

# Check the first few entries to confirm the change
twitter_df.head(5)

Unnamed: 0,user_id,date,user_handle,tweet_text
0,1467810672,2009-04-06 22:19:49,scotthamilton,is upset that he can't update his Facebook by ...
1,1467810917,2009-04-06 22:19:53,mattycus,@Kenichan I dived many times for the ball. Man...
2,1467811184,2009-04-06 22:19:57,ElleCTF,my whole body feels itchy and like its on fire
3,1467811193,2009-04-06 22:19:57,Karoli,"@nationwideclass no, it's not behaving at all...."
4,1467811372,2009-04-06 22:20:00,joy_wolf,@Kwesidei not the whole crew


In [40]:
# Check the data types to confirm the change
print(twitter_df['date'].dtype)

datetime64[ns]


In [41]:
twitter_df.isnull().values.any()

False

In [42]:
twitter_df.describe()

Unnamed: 0,user_id,date
count,1599999.0,1599999
mean,1998818000.0,2009-05-31 07:26:27.994492416
min,1467811000.0,2009-04-06 22:19:49
25%,1956916000.0,2009-05-28 23:01:17.500000
50%,2002102000.0,2009-06-02 03:08:55
75%,2177059000.0,2009-06-15 05:21:43.500000
max,2329206000.0,2009-06-25 10:28:31
std,193575700.0,


The summary statistics for the user_id and date columns provide many insights into the user's activity and timing characteristics during the data collecting period. The user_id has a mean of around 1.998 billion, indicating a wide range of values, potentially due to a big user base or a wide encoding range for user identification. The difference between the minimum (1.467811e+09) and maximum (2.329206e+09) values, with a standard deviation of around 193.6 million, indicates significant variability in user ID assignments, which could be due to different periods of user registration or system changes affecting ID allocation. The date column ranges from April 6, 2009 to June 25, 2009, with the median occurring on June 2, 2009. 

The dataset may have captured seasonal user behaviour or events because of its temporal distribution, which shows that it covers almost three months in late spring to early summer. The mid-June and end-of-May 25th and 75th percentiles, respectively, point to a concentration of recordings around these dates, which may correspond with particular events or promotions that encourage user participation during these times.

#### Histogram of Text Length and Word Count

In [None]:
tweets_df['text_length'].plot(kind='hist', title='Distribution of Tweet Text Length', bins=20, alpha=0.7)
plt.xlabel('Text Length')
plt.show()

tweets_df['word_count'].plot(kind='hist', title='Distribution of Tweet Word Count', bins=20, alpha=0.7)
plt.xlabel('Word Count')
plt.show()

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Convert to Pandas DataFrame
tweet_data = jdbc_df.limit(1000).toPandas()

# Example visualization: Histogram
tweet_data['tweet_length'] = tweet_data['tweet_text'].apply(len)
tweet_data['tweet_length'].hist(bins=20)
plt.xlabel("

#### Distributions
Explore the distribution of key variables using histograms or box plots to understand their central tendencies and spread.

#### Histograms: Useful for visualizing the distribution of numerical data.

In [None]:
user_ids = tweets_df['user_id']

# Calculate histogram data
counts, bin_edges = np.histogram(user_ids, bins=10)

# Create a histogram
plt.figure(figsize=(10, 6))  # Make it large enough for clarity
plt.bar(bin_edges[:-1], counts, width=np.diff(bin_edges), edgecolor='black', align='edge', color='skyblue')
plt.title('Distribution of User IDs')
plt.xlabel('User ID Bins')
plt.ylabel('Frequency')
plt.grid(True)  

# Remove all unnecessary tick marks and frame pieces
plt.tick_params(top=False, right=False, left=True, bottom=True, labelleft=True, labelbottom=True)
for spine in plt.gca().spines.values():
    if spine.spine_type not in ['bottom', 'left']:  
        spine.set_visible(False)

plt.show()

#### Box Plots: Good for detecting outliers and understanding the distribution's quartiles.

In [None]:
import matplotlib.pyplot as plt

# Create a box plot
plt.figure(figsize=(8, 6)) 
plt.boxplot(user_ids, vert=False, widths=0.7, patch_artist=True, flierprops={'marker':'o', 'color':'red', 'markersize':5})
plt.title('User ID Distribution')
plt.xlabel('User IDs')
plt.yticks([])  

# Enhance data-ink ratio
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_visible(False)  # Remove left spine if y-axis ticks are not informative
plt.grid(True, linestyle='--', which='major', color='gray', alpha=0.5)

plt.show()

#### Plot histograms and density plots of user_id to understand user engagement and presence. 
Because the histogram helps to visualize how active users are in terms of the number of tweets they post. And the density plot provides a clear view of the distribution's shape, highlighting the typical user activity levels without the binning process of a histogram.

First, set up correctly for visualising user activity using user_id and analysing the frequency of user interactions.

In [None]:
# Count the frequency of each user_id
user_activity = tweets_df['user_id'].value_counts()

import matplotlib.pyplot as plt

# Create a histogram of user activity
plt.figure(figsize=(10, 6))
plt.hist(user_activity.values, bins=30, color='skyblue', edgecolor='black')
plt.title('Histogram of User Activity')
plt.xlabel('Number of Tweets per User')
plt.ylabel('Frequency')
plt.grid(True, which='both', linestyle='--', linewidth=0.5, alpha=0.7)  # Minimal grid use
plt.show()

In [None]:
import seaborn as sns

# Create a density plot
plt.figure(figsize=(10, 6))
sns.kdeplot(user_activity.values, shade=True, color='blue', alpha=0.7)
plt.title('Density Plot of User Activity')
plt.xlabel('Number of Tweets per User')
plt.ylabel('Density')
plt.show()

#### Pair Plots
Use pair plots to visualise the relationships and distributions of each pair of variables in the data.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create a pair plot with a clean and minimalist design
sns.set(style="white")  # Sets the style of the plot to a simple white background
pairplot = sns.pairplot(tweets_df, diag_kind='kde', plot_kws={'alpha': 0.6, 's': 80, 
                                                              'edgecolor': 'k'}, corner=True)
for i in range(len(pairplot.axes)):
    for j in range(len(pairplot.axes)):
        if i != j:
            pairplot.axes[i][j].set_visible(False)
            if i == j:
                pairplot.axes[i][j].set_ylabel('Density')

# Add titles and labels
plt.subplots_adjust(top=0.9)
pairplot.fig.suptitle('Pairwise Plots of User Metrics', fontsize=16)

plt.show()

#### Correlation Matrix
Investigate the correlations between numerical variables to better understand the links between the fields in the dataset.
Creating a correlation matrix is an excellent way to visually and quantitatively investigate the correlations between numerical variables in your dataset.

In [None]:
import pandas as pd

# Compute the correlation matrix
corr_matrix = tweets_df.corr()

# Print the correlation matrix
print(corr_matrix)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create a heatmap from the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', cbar=True, square=True, linewidths=.5)

# Enhancing the heatmap
plt.title('Correlation Matrix of Variables')
plt.yticks(rotation=0)  # Rotate y-labels for better readability
plt.xticks(rotation=90)  # Rotate x-labels for better readability
plt.show()

### Updated

### Text Data Analysis
Analyzing the tweet texts, let's counting the number of words in each tweet.

Before calling, x (the tweet text) is explicitly converted to a string using `str(x).split()` is a helpful conversion because it guards against errors that could occur if non-string data were unintentionally included in the tweet_text. It guarantees that the.split() method, which splits the text into words based on spaces, always has a string to work on by first converting everything to a string.

In [63]:
# Calculate the length of each tweet and the word count
twitter_df['word_count'] = twitter_df['tweet_text'].apply(lambda x: len(str(x).split()))
twitter_df.head(5)

Unnamed: 0,user_id,date,user_handle,tweet_text,text_length,word_count,sentiment,clean_tweet_text
0,1467810672,2009-04-06 22:19:49,scotthamilton,is upset that he can't update his Facebook by ...,111,21,0.0,is upset that he cant update his facebook by t...
1,1467810917,2009-04-06 22:19:53,mattycus,@Kenichan I dived many times for the ball. Man...,89,18,0.5,i dived many times for the ball managed to sav...
2,1467811184,2009-04-06 22:19:57,ElleCTF,my whole body feels itchy and like its on fire,47,10,0.2,my whole body feels itchy and like its on fire
3,1467811193,2009-04-06 22:19:57,Karoli,"@nationwideclass no, it's not behaving at all....",111,21,-0.625,no its not behaving at all im mad why am i her...
4,1467811372,2009-04-06 22:20:00,joy_wolf,@Kwesidei not the whole crew,29,5,0.2,not the whole crew


The character count of each tweet and its total word count. An example of a moderately long tweet is the one by user "scotthamilton" in the first row, which has 111 characters and 21 words. The tweet from "joy_wolf" in the last row, on the other hand, is substantially shorter—just 29 characters and 5 words. These metrics are useful for examining tweet verbosity and content density. Applications such as sentiment analysis, user engagement research, and even social media language usage modelling can benefit from this kind of data.

### Sentiment Analysis

In [64]:
#!pip install textblob

#### Text Cleaning
For sentiment analysis, text needs to be cleaned for better results.

In [68]:
import re

def clean_text(text):
    text = re.sub(r'@\w+', '', text)  # Remove mentions
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove punctuation and numbers
    text = text.lower().strip()  # Convert to lower case and strip whitespaces
    return text

twitter_df['clean_tweet_text'] = twitter_df['tweet_text'].apply(clean_text)

#### TextBlob to add sentiment scores to the DataFrame.

In [69]:
# Use libraries TextBlob 
from textblob import TextBlob

# Function to get the polarity score
def get_sentiment(text):
    return TextBlob(text).sentiment.polarity

twitter_df['sentiment'] = twitter_df['clean_tweet_text'].apply(get_sentiment)

# Display the columns 'tweet_text', 'sentiment', and 'clean_tweet_text' 
twitter_df[['tweet_text', 'sentiment', 'clean_tweet_text']].head(5)

Unnamed: 0,tweet_text,sentiment,clean_tweet_text
0,is upset that he can't update his Facebook by ...,0.0,is upset that he cant update his facebook by t...
1,@Kenichan I dived many times for the ball. Man...,0.5,i dived many times for the ball managed to sav...
2,my whole body feels itchy and like its on fire,0.2,my whole body feels itchy and like its on fire
3,"@nationwideclass no, it's not behaving at all....",-0.625,no its not behaving at all im mad why am i her...
4,@Kwesidei not the whole crew,0.2,not the whole crew


The code effectively cleans and processes text for sentiment analysis, allowing for more accurate interpretation of emotional content in tweets. The findings show that TextBlob can distinguish between various emotional tones in tweets, ranging from strong negativity to positivity, based on lexical content, making it a useful tool for analysing sentiment in social media texts. This preprocessing and analysis pipeline serves as a solid foundation for future sentiment-based applications such as trend analysis, customer feedback, and long-term monitoring of public opinion.

## Machine Learning

### VADER (Valence Aware Dictionary)
Use VADER (Valence Aware Dictionary) for sentiment analysis. The VADER was chosen because it is well-suited for social media text analysis due to its sensitivity to both polarity and emotional intensity.

#### Install Necessary Package
First, install nltk library, which includes the VADER sentiment analysis tool. 

In [77]:
#!pip install nltk
#!pip install nltk scikit-learn

####  Import Libraries and Load VADER
Import necessary libraries and download the VADER lexicon.

In [72]:
import pandas as pd

from nltk.sentiment import SentimentIntensityAnalyzer
import nltk
nltk.download('vader_lexicon')

# Initialize VADER
sia = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\rosil\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


#### Define a Function to Get Sentiment Scores and Apply the Sentiment Function

In [75]:
# Create a function that uses VADER to compute the sentiment scores.
def get_vader_sentiment(text):
    
    # VADER outputs a dictionary of scores; 'compound' gives the normalized sentiment score
    return sia.polarity_scores(text)['compound']

# Use the function to calculate sentiment scores for each tweet.
twitter_df['vader_sentiment'] = twitter_df['tweet_text'].apply(get_vader_sentiment)

# Display the first few rows to check the sentiment scores results
print(twitter_df[['tweet_text', 'vader_sentiment']].head())

                                          tweet_text  vader_sentiment
0  is upset that he can't update his Facebook by ...          -0.7500
1  @Kenichan I dived many times for the ball. Man...           0.4939
2    my whole body feels itchy and like its on fire           -0.2500
3  @nationwideclass no, it's not behaving at all....          -0.6597
4                      @Kwesidei not the whole crew            0.0000


In [None]:
daily_sentiment = tweets_df['sentiment'].resample('D').mean()
daily_sentiment.plot(title='Daily Sentiment Score')
plt.xlabel('Date')
plt.ylabel('Sentiment Score')
plt.show()

### Time Series Analysis and Forecasting

#### Set Date as Index: For time series analysis, it's useful to have the date as the DataFrame index

In [None]:
tweets_df.set_index('date', inplace=True)
tweets_df.head(2)

In [None]:
#### Aggregate Sentiment Over Time: 
Resample the sentiment data to daily, and calculate mean sentiment.

#### Trend Analysis: 
Ploting time series of counts of tweets per day to see if there's any visible trend or seasonality.

In [None]:
tweets_df.resample('D').size().plot(title='Daily Tweets')
plt.xlabel('Date')
plt.ylabel('Number of tweets')
plt.show()

### Seasonality Check: 
Use seasonal decomposition to observe inherent seasonality in the data.

In [None]:
!pip install statsmodels

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(tweets_df.resample('D').size(), model='additive')
result.plot()
plt.show()

### Correlation and Causation Analysis

#### Stationarity Check: 
Ensure the time series data is stationary, as this is a requirement for models like ARIMA.

In [None]:
from statsmodels.tsa.stattools import adfuller
test_result = adfuller(daily_sentiment.dropna())
print('ADF Statistic: %f' % test_result[0])
print('p-value: %f' % test_result[1])

#### Time Series Plot of Tweet Frequency

In [None]:
import matplotlib.pyplot as plt

# Plotting the number of tweets over time
twitter_df.set_index('date').resample('S').size().plot()
plt.title('Tweet Frequency Over Time')
plt.xlabel('Time')
plt.ylabel('Number of Tweets')
plt.show()

### Time Series Preparation
Ensure the DataFrame is sorted by date and set the date as an index for time series analysis.

In [None]:
twitter_df = twitter_df.sort_values(by='date')
twitter_df.set_index('date', inplace=True)
print(twitter_df.head())

### Forecasting:
Implement time series forecasting models such as ARIMA or LSTM

### Data Visualization and Reporting
Visualizing aspects of the data can further aid in understanding the distribution and relationships. 

Dynamic Dashboard Creation:
Use an appropriate tool like Plotly and Dash to create an interactive dashboard. Prepare the data in notebook and export it for visualization.

Documentation:
Document the findings, methods, and justifications in the report according to the project guidelines.

References:

https://spark.apache.org/docs/latest/api/python/development/debugging.html#:~:text=1%202%204-,Py4j,its%20stack%20trace%2C%20as%20java.

Apache Parquet Documentation
The official documentation for the Parquet file format offers insights into its design, features, and benefits for using it in data storage and processing tasks.https://spark.apache.org/docs/latest/

Databricks Resources
Databricks, a company founded by the creators of Apache Spark, provides extensive resources, blogs, and tutorials on Spark and Parquet, including best practices for performance optimization.
Link: Databricks - Apache Spark Resources

This book by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia (O'Reilly Media) is a great resource to learn about Spark from the ground up, covering basic to advanced topics.
ISBN: 978-1449358624