                                                                       TEMUULEN Bulgan - 2022427

### CCT College Dublin Continuous Assessment No.2
# AN ANALYSIS OF INDIAN FARMERS' PROTEST TWEETS.

**Brief Introduction of the project:**

1. For my second continuous assessment, I choose CSV format data of Indian Farmer's Protest Tweets. This file contains over 1 million English language tweets tweeted between November 1st, 2020 and november 21st, 2021 with the hashtag <#FarmersProtest>. It is downloaded from the Kaggle website with the CCO:Public Domain license.
(https://kaggle.com/datasets/prathamsharma123/farmers-protest-tweets-dataset-csv)

2. I divided my project into 3 primary sections (Every step in data processing and analysis is fully discussed on each subsection of these primary sections.):
    1. big data storage and processing
    2. comparative analysis of databases
    3. sentiment analysis and forecast

3. I used Git for daily code tracking and GitHub for archiving, monitoring and sharing. To view the whole project on GitHub, click on the follwig link 

**Libraries and modules used for this project:**

In [1]:
import pandas as pd
import os
import re
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

## 1. Big Data Storage and Processing

For my second continous project I choose Kaggle.com's dataset called 'Farmers Protest Tweets'. It was collected by the hashtag #FarmersProtest including CCO:Pulic Domain license which means that this dataset allows copy, modify, distribute and perform the work, even for commercial purpososes, all without asking permission. All the tweets in it in English Language and collected from Twitter.com from November 1st, 2020 to November 21st, 2021. The main subject matter of these tweets are about the biggest anti-farm laws protest which took place at the borders of the Indian natioanl capital of New Delhi, organized by coalition of over 40 farmer from across the country.The dataset extraction process was done by Pratham Sharma, kaggle datasets expert who used Twitter API and  snscrape Python library for collection. The tweets data consist of two separate CSVL files with size the size of 1.7 GB and 81.2 MB.

In [None]:
# Checking the file size of the file.
file_size = os.path.getsize('/home/hduser/Desktop/ca/tweets.csv')/(1024*1024*1024)
print(f'The size of the tweets file is: {file_size:.2f} GB.')
file_size = os.path.getsize('/home/hduser/Desktop/ca/users.csv')/(1024*1024)
print(f'The size of the users file is: {file_size:.2f} MB.')

### 1.1. *Preprocessing in Python.*

Before to start processing the DataSet in distributed file system platforms I decided to examine each CSV file using Pandas on Jupyter Notebook.

***tweets.csv***

In [None]:
# Creating a DataFrame.
df_tweets = pd.read_csv('/home/hduser/Desktop/ca/tweets.csv')

# Checking the DataFrame.
df_tweets.head(2)

In [None]:
# Printing information about the DataFrame
df_tweets.info()

***users.csv***

In [None]:
# Creating a DataFrame.
df_users = pd.read_csv('/home/hduser/Desktop/ca/users.csv')

# Checking the DataFrame.
df_users.head(2)

In [None]:
# Printing information about the DataFrame
df_users.info()

As we can see from abowe four code sells the DataSet with tweets consisted of 13 distinct columns of information where some of which is not very importand for further processing. The columns such as 'tweetUrl', 'tweetId', 'source', 'media', 'retweetedTweet', 'quotedTweet', 'mentionedUsers', and 'userId' doesn't include importand information for analysis.

On the other hand DataSet with users information consisted of 18 distincst columns of information from which I can only use only column with 'display name. So further I'm going to remove unnessecary columns from each DataFrame and merge them as one.

In [None]:
# Deleting columns from the DataFrame of tweets.
df_tweets = df_tweets.drop(labels=['tweetUrl', 'tweetId', 'source', 'media', 'retweetedTweet', 'quotedTweet', \
                                           'mentionedUsers'], axis=1)

# Checking the changes.
df_tweets.head(2)

In [None]:
# Deleting columns from the DataFrame of tweets.
df_users = df_users[['displayname', 'userId']]

# Checking the changes.
df_users.head(2)

In [None]:
# Performing left merging on two DataFrames.
df_final = pd.merge(df_tweets, df_users, on='userId', how='left')

# Deleting the column 'userId'.
df_final = df_final.drop(labels=['userId'], axis=1)

# Checking the changes.
df_final.head(10)

I merged two DataFrames and removed unessary columns. Now I'm going to save the it as a CSV file on my Ubuntu VM. 

In [None]:
# Saving the DataFrame as 'new_tweets.csv' on my VM with the utf-8 Encoding.
df_final.to_csv('/home/hduser/Desktop/ca/new_tweets.csv', encoding='utf-8', index=False)

In [None]:
# Checking the file size.
new_file_size = os.path.getsize('/home/hduser/Desktop/ca/new_tweets.csv')/(1024*1024)
print(f'The size of new tweet file is: {new_file_size:.2f} MB.')

print(df_final.shape())

The file size is reduced from 1.7GB to 670.2MB and it still have importand information of the tweets for further processing.

### 1.2 *Data cleaning in Pyspark.*

I desided to store my new created twitter dataset file in HDFS and before doing EDA and sentiment analysis I'll further do more thorough preprocessing using Apache Spark tools.

In [2]:
# Starting a new SparkSession for data import from HDFS.
spark = SparkSession.builder \
        .appName('HDFS Data Import') \
        .getOrCreate()

In [3]:
# Reading the file.
df_spark = spark.read.option('header', 'true') \
                        .option('multiline', 'true') \
                        .option('quote', "\"") \
                        .option('escape', "\"") \
                        .csv('/ca2/new_tweets.csv')

In [4]:
# Checking the contents of the DataFrame.
df_spark.show(100)

+--------------------+--------------------+----------+------------+---------+----------+--------------------+
|                date|     renderedContent|replyCount|retweetCount|likeCount|quoteCount|         displayname|
+--------------------+--------------------+----------+------------+---------+----------+--------------------+
|2021-03-30 03:33:...|Support 👇\n\n#Fa...|         0|           0|        0|         0|                null|
|2021-03-30 03:33:...|Supporting farmer...|         0|           0|        0|         0|                null|
|2021-03-30 03:31:...|Support farmers i...|         0|           0|        0|         0|                null|
|2021-03-30 03:30:...|#StopHateAgainstF...|         0|           1|        3|         0|       Sukhdev Singh|
|2021-03-30 03:30:...|You hate farmers ...|         0|           0|        1|         0|                null|
|2021-03-30 03:29:...|They can't be far...|         0|           0|        0|         0|   Abhimanyu 🌏 🇮🇳|
|2021-03-30 03

Using 'SparkSession.builder' I created a new SparkSession for interacting with Spark functionality. Then I created a new PySpark DataFrame by importing the 'new_tweets.csv' file which is stored in HDFS's 'ca2' directory. 

And now before I start examination and cleaning process of the newly created DataFrame, I would like to remove duplicates and NaN value rows from the DataFrame. As can be seen from the above cell, there are some NaN values in the new merged column 'displayname'. It seems like some users that tweeted on the twitter doesn't have a Display Name. It might indicate that they no longer user of the social media plaform, or even might have blocked becuse of trolling or whatever reason the account might have been deleted. So to prevent the bias and also to not waste my memory for extra processing I have decided to drop the rows which doesn't contain the display name or contain duplicate tweet entries.

In [5]:
# Dropping the duplicates.
df_spark = df_spark.drop_duplicates()

# Dropping rows with null values in the column 'displayname'.
df_spark = df_spark.filter(col('displayname').isNotNull())

In [12]:
# Checking the contents of the DataFrame.
df_spark.show(100)



+--------------------+--------------------+----------+------------+---------+----------+--------------------+
|                date|     renderedContent|replyCount|retweetCount|likeCount|quoteCount|         displayname|
+--------------------+--------------------+----------+------------+---------+----------+--------------------+
|2020-11-01 03:36:...|Yesterday in a pu...|         8|          95|      389|         5|Manickam Tagore ....|
|2020-11-01 10:54:...|Has this been rep...|         0|           0|        1|         0|        ClaireDeLune|
|2020-11-01 12:10:...|Such a shame!\n\n...|         0|           0|        2|         0|          Rajay Deep|
|2020-11-01 23:55:...|@WhiteHouse @real...|         1|           0|        3|         0|          OnTheFritz|
|2020-11-02 02:28:...|Other side of APM...|         0|           0|        1|         0|Fateh Singh  Bhullar|
|2020-11-02 06:59:...|Given #FarmersPro...|         0|           0|        0|         0|IndoAsianCommodities|
|2020-11-0

                                                                                

In [9]:
# Checking the shape of the DataFrame
print('Number of rows:', df_spark.count())
print('Number of columns:', len(df_spark.columns))



Number of rows: 1082432
Number of columns: 7


                                                                                

In [10]:
# Checking column dtypes.
df_spark.printSchema()

root
 |-- date: string (nullable = true)
 |-- renderedContent: string (nullable = true)
 |-- replyCount: string (nullable = true)
 |-- retweetCount: string (nullable = true)
 |-- likeCount: string (nullable = true)
 |-- quoteCount: string (nullable = true)
 |-- displayname: string (nullable = true)



The I can see from above exicuted examination that there are some columns has to be renamed and thier dtypes has to be changed to the proper dtypes. Furthemore, for column 'renderedContent' there has to be done some cleaning (removing tags, hashtags, emails, and links). 

In [None]:
# Renaming the columns.
df_spark = df_spark.select(col('date').alias('date'),
                           col('renderedContent').alias('tweet'),
                           col('displayname').alias('user'),
                           col('replyCount').alias('replied'),
                           col('retweetCount').alias('retweeted'),
                           col('likeCount').alias('liked'),
                           col('quoteCount').alias('quoted'))

In [None]:
# Changing dtypes.
df_spark = df_spark.withColumn('date',to_date(col('date').cast(DateType())))
df_spark = df_spark.withColumn('tweet',col('tweet').cast(StringType()))
df_spark = df_spark.withColumn('user',col('user').cast(StringType()))
df_spark = df_spark.withColumn('replied',col('replied').cast('integer'))
df_spark = df_spark.withColumn('retweeted',col('retweeted').cast('integer'))
df_spark = df_spark.withColumn('liked',col('liked').cast('integer'))
df_spark = df_spark.withColumn('quoted',col('quoted').cast('integer'))

In [None]:
# Removing tags, hashtags, emails, and website links from the values of the column 'tweet'.
df_spark = df_spark.withColumn('tweet', regexp_replace('tweet', r'@\w+|#\S+|\S+@\S+|http\S+|www\S+|\S+/\S+', ''))

In [None]:
# Checking the contents of the DataFrame.
df_spark.show(100)

In [None]:
# Checking the contents of the column "renderedContent".
df_spark.select('tweet').show(100, False)

The DataFrame column names are changed and the dtypes for each column was set correctly. Also emails, links, tags, hashtags from The 'tweet' column values were successfully removed. However, from the above cell I can see that the  column 'tweet' still missing some cleaning. Further, I will remove leading and traling whitespaces and non-English texts from the string value of the column. Then, I'll replace two or more continous whitespaces with a single whitespace, and also, I'll replace the new line and tab with '.'. And lastly I'll lowercase the entire string for each column value.

In [None]:
# Removing leading and trailing whitespace from the values of the column 'tweet'.
df_spark = df_spark.withColumn('tweet', trim(df_spark.tweet))

# Replacing two or more spaces to one from the values of the column 'tweet'.
df_spark = df_spark.withColumn('tweet', regexp_replace('tweet', r'\s{2,}', ' '))

# Replacing new-line and tab to '.' from the values of the column 'tweet'.
df_spark = df_spark.withColumn('tweet', regexp_replace('tweet', r'\n|\t', '.'))

# Removing non-English text from the values of the column 'tweet'.
df_spark = df_spark.withColumn('tweet', regexp_replace('tweet', "[^a-zA-Z0-9!@#$%^&*()_+\-={}\[\]|\\;:'\",.<>/?~` ]", ''))

# Lowercasing all the values of the column 'tweet'.
df_spark = df_spark.withColumn('tweet', lower(df_spark.tweet))

In [None]:
df_spark.select('tweet').show(100, False)

In [11]:
df_spark = df_spark.orderBy('date')

In [None]:
df_spark.groupBy('user').count().sort('count', ascending=False).show()

In [None]:
df_spark.collect()[1][1]

In [None]:
print(type(df_spark.count()))

In [None]:
spark.sql('select * from tweet').show(4)

In [None]:
spark.stop()