In [7]:
!pip install nltk
!pip install spacy



Install nltk and spacy libraries

#Search V1.0

##Offline Processing

In [8]:
import re

def load_tweets(filename):
    with open(filename, 'r') as file:
        tweets = file.readlines()
    return tweets

# Load tweets from file
tweets = load_tweets('australian_election_2019_tweets.txt')
print("Number of tweets before cleaning:", len(tweets))

def remove_duplicates(tweets):
    return list(set(tweets))

# Remove duplicates
tweets = remove_duplicates(tweets)

def remove_urls_mentions_hashtags_nonenglish(tweets):
    cleaned_tweets = []
    for tweet in tweets:
        # Remove URLs
        tweet = re.sub(r'http\S+', '', tweet)
        # Remove mentions
        tweet = re.sub(r'@\w+', '', tweet)
        # Remove hashtags
        tweet = re.sub(r'#\w+', '', tweet)
        # Remove non-English text
        tweet = re.sub(r'[^\x00-\x7F]+', '', tweet)
        cleaned_tweets.append(tweet.strip())
    return cleaned_tweets

# Remove URLs, mentions, hashtags, and non-English text
cleaned_tweets = remove_urls_mentions_hashtags_nonenglish(tweets)

print("Number of tweets after cleaning:", len(cleaned_tweets))

Number of tweets before cleaning: 347192
Number of tweets after cleaning: 264734


After loading the tweets into memory, I ensure data integrity by removing any duplicate entries from the dataset. Next, I focus on cleaning each tweet to refine the dataset further. This involves removing URLs, mentions, hashtags, and any non-English characters present in the text. Leveraging regular expressions, I efficiently identify and eliminate these elements from each tweet. Once the cleaning process is complete, I output the number of tweets remaining after the preprocessing steps. The result is a meticulously curated collection of tweets, stripped of extraneous elements, ready for subsequent analysis or utilization in natural language processing tasks.

##Real Time Usage

In [9]:
import random

def generate_hashtags():
    single_word_hashtags = ['#energy', '#laws', '#parliament', '#coalition', '#change']
    multiword_hashtags = ['#newPolicy', '#Election2019', '#moreSkill', '#CyberSecurity', '#violenceLaws']

    hashtags = single_word_hashtags + multiword_hashtags
    hashtags = [hashtag.replace('#', '') for hashtag in hashtags]

    return hashtags

def string_match_search(tweets, hashtags):
    tweet_scores = {hashtag: [] for hashtag in hashtags}

    for tweet in tweets:
        for hashtag in hashtags:
            if hashtag.lower() in tweet.lower():
                score = sum(1 for word in tweet.split() if word.lower() == hashtag.lower())
                tweet_scores[hashtag].append((tweet, score))

    top_tweets = {}
    for hashtag, scores in tweet_scores.items():
        sorted_tweets = sorted(scores, key=lambda x: x[1], reverse=True)[:5]
        top_tweets[hashtag] = [tweet for tweet, _ in sorted_tweets]

    return top_tweets

# Generate hashtags
hashtags = generate_hashtags()

# Perform string match-based search
top_tweets = string_match_search(cleaned_tweets, hashtags)

# Display top 5 tweets for each hashtag
for hashtag, tweets in top_tweets.items():
    print(f'Top 5 tweets for {hashtag}:')
    for i, tweet in enumerate(tweets, 1):
        print(f'{i}. {tweet}')
    print()

Top 5 tweets for energy:
1. CO2 is NOT a pollutant!  It is an essential nutrient for plants; in the process of photosynthesis, plants convert radiant energy from sun into chemical energy in the form of glucose or sugar:  6 H2O + 6CO2 + radiant energy (sunlight) --&gt; C6H12O6 (glucose) + 6O2 (oxygen)
2. As we head into The Climate election ponder the possible future of free energy and Australia as a renewable superpower. Its on its way. The radical path to free energy for all - Financial Review, 5/17/2019
3. Yet we havent seen one media outlet cover Michael Wests investigation of the incestuous relationships b/w LNP &amp; the Coal Industry    Liberal partys rank opportunism spells danger for Australian energy policy | Energy | The Guardian
4. Barnaby Joyce spruiking energy policy is important on TV with cheering constituents behind him and his party doesn't have an energy policy.
5. Low Energy Ollie already coming up with excuses for a low energy result tonight. Sad!

Top 5 tweets for 

The function generate_hashtags() is used to create a list of hashtags, combining single-word and multiword hashtags while removing the '#' symbol. Then, the string_match_search(tweets, hashtags) function is implemented to find relevant tweets for each hashtag. It iterates through tweets and hashtags, calculating relevance scores based on hashtag occurrences in tweets. Top 5 tweets for each hashtag are extracted and stored in a dictionary. Finally, I generate hashtags, perform the search, and display the top 5 tweets for each hashtag. This process offers insights into the most pertinent tweets for each topic.

# Search V2.0

##Offline Processing

In [10]:
import re

def load_tweets(filename):
    with open(filename, 'r') as file:
        tweets = file.readlines()
    return tweets

# Load tweets from file
tweets = load_tweets('australian_election_2019_tweets.txt')
print("Number of tweets before cleaning:", len(tweets))

def remove_duplicates(tweets):
    return list(set(tweets))

# Remove duplicates
tweets = remove_duplicates(tweets)

def remove_urls_mentions_hashtags_nonenglish(tweets):
    cleaned_tweets = []
    for tweet in tweets:
        # Remove URLs
        tweet = re.sub(r'http\S+', '', tweet)
        # Remove mentions
        tweet = re.sub(r'@\w+', '', tweet)
        # Remove hashtags
        tweet = re.sub(r'#\w+', '', tweet)
        # Remove non-English text
        tweet = re.sub(r'[^\x00-\x7F]+', '', tweet)
        cleaned_tweets.append(tweet.strip())
    return cleaned_tweets

# Remove URLs, mentions, hashtags, and non-English text
cleaned_tweets = remove_urls_mentions_hashtags_nonenglish(tweets)

print("Number of tweets after cleaning:", len(cleaned_tweets))

Number of tweets before cleaning: 347192
Number of tweets after cleaning: 264734


This is the same code as sued in the Search V1.0.

##Preprocessing Techniques

In [11]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def preprocess_tweet(tweet):
    # Lower casing
    tweet = tweet.lower()
    # Tokenization
    words = word_tokenize(tweet)
    # Stemming
    ps = PorterStemmer()
    stemmed_words = [ps.stem(word) for word in words]
    # Stopword removal
    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in stemmed_words if word not in stop_words]
    return ' '.join(filtered_words)

def preprocess_tweets(tweets):
    preprocessed_tweets = [preprocess_tweet(tweet) for tweet in tweets]
    return preprocessed_tweets

# Apply text pre-processing techniques
preprocessed_tweets = preprocess_tweets(cleaned_tweets)

print("Number of tweets after cleaning and preprocessing:", len(preprocessed_tweets))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Number of tweets after cleaning and preprocessing: 264734


This is for preprocessing text data using regular expressions and NLTK. It employs lowercasing, tokenization, stemming, and stopword removal for cleaning tweets. The preprocess_tweet(tweet) function handles individual tweet preprocessing, while the preprocess_tweets(tweets) function extends this cleaning to a list of tweets. Finally, the script prints the number of tweets after preprocessing, demonstrating its effectiveness in enhancing data quality.

In [12]:
# B. Real Time Usage
def string_match_search_v2(tweets, hashtags):
    tweet_scores = {hashtag: [] for hashtag in hashtags}

    for tweet in tweets:
        preprocessed_tweet = preprocess_tweet(tweet)
        for hashtag in hashtags:
            if hashtag.lower() in preprocessed_tweet:
                score = sum(1 for word in preprocessed_tweet.split() if word.lower() == hashtag.lower())
                tweet_scores[hashtag].append((tweet, score))

    top_tweets = {}
    for hashtag, scores in tweet_scores.items():
        sorted_tweets = sorted(scores, key=lambda x: x[1], reverse=True)[:5]
        top_tweets[hashtag] = [tweet for tweet, _ in sorted_tweets]

    return top_tweets

# Perform search for the same set of hashtags
top_tweets_v2 = string_match_search_v2(cleaned_tweets, hashtags)

# Display top 5 tweets for each hashtag after preprocessing
for hashtag, tweets in top_tweets_v2.items():
    print(f'Top 5 tweets for {hashtag}:')
    for i, tweet in enumerate(tweets, 1):
        print(f'{i}. {tweet}')
    print()

Top 5 tweets for energy:
1. Wht ws it if not FASCISM/TREASN tht NEO-COMMUNIST GRNLAB/TBULL/LEFT 1st weaknd COAL-our only reliab energy-by scaring off ALL INVESTRS &amp; BANKS W/WIDE-made it a PARIAH-STOLE GOVT by COUP 2 enslave COAL 2 th LW COUPGOVT/BIZ-favord uselss INTRMITTNTS SO 'RE' CAN DISABLE A'a
2. ETEnergyworld | COLUMN-Australia's shock election shows killing coal mining is no sure thing: Russell
3. Australia cant afford the climate-denying, energy-policy-free, unstable LNP.  Vote in thinking candidates unshackled by the orthodoxies of the major parties.
4. Wht is it if not FASCISM/TREASN tht NEO-COMMUNIST GRNLAB/TBULL/LEFT 1st weakns COAL-our only reliab energy-by scaring off ALL INVESTRS &amp; BANKS W/WIDE-makes it a PARIAH-STEALS GOVT by COUP 2 nslave COAL 2 th LW COUPGOVT/BIZ-favord uselss INTRMITTNTS SO 'RE' CN DISABLE A'a
5. Labor is ready to act on climate change by reducing Australias pollution by 45 per cent on 2005 levels by 2030 and reaching net zero pollution by 20

This is the same code that was used in Search V1.0 for real time usage. It has been adapted for real time usage in this scenario.

#Search Results and Insights:

##1. Simple Word-Overlap Based Match (Search V1.0):

###Offline Processing:
In this phase, the script loads tweets from the file and performs initial data cleaning tasks, including removing duplicates and eliminating extraneous elements such as URLs, mentions, hashtags, and non-English text. This ensures that the dataset is streamlined and ready for further analysis. Removing duplicates helps in reducing redundancy, while cleaning ensures that only relevant text is retained for subsequent processing.

###Real Time Usage:
Once the tweets are preprocessed, the script generates hashtags and conducts a string match-based search to extract the top 5 tweets for each hashtag. The relevance of tweets is determined based on word occurrences in the hashtags. Analyzing these top tweets provides valuable insights into the most discussed topics on Twitter, enabling a deeper understanding of public opinions, trends, and sentiments surrounding various hashtags.


##2. Improving Search Quality with Text-preprocessing (Search V2.0):

###Offline Processing:
In this stage, the script enhances the search quality by applying advanced text preprocessing techniques to the tweets. This includes lower casing, tokenization, stemming, and stopword removal. Lower casing ensures uniformity, tokenization breaks text into meaningful units, stemming reduces words to their root form, and stopword removal eliminates common words that do not contribute to the meaning. These preprocessing steps help in standardizing the text and removing noise, thereby improving the quality of the dataset.

###Real Time Usage:
Building upon the preprocessed data, the script conducts the same string match-based search as in V1.0 but on the cleaned tweets. Relevance scores are still determined based on word occurrences in the hashtags. However, the preprocessing steps applied in V2.0 lead to more accurate and meaningful search results compared to V1.0. By removing noise and standardizing the text, the search algorithm can better identify and extract tweets that are truly relevant to the hashtags, resulting in higher-quality insights and analysis.







I have used some online resources to help me with my code. These include:

https://stackoverflow.com/questions/64719706/cleaning-twitter-data-pandas-python

https://aronakhmad.medium.com/twitter-data-cleaning-using-python-db1ec2f28f08

https://stackoverflow.com/questions/51717995/extracting-data-tweets-based-on-a-specific-hashtag

https://github.com/shresth26/Twitter-Hashtag-Analysis