In [1]:
import pandas as pd

In [2]:
df = pd.read_parquet('datasets/btc_tweets_train.parquet.gzip')

In [3]:
df = df.reset_index()

In [4]:
df.head()

Unnamed: 0,tweet ID,hashtags,content,username,user_displayname,sentiment
0,1641579121972236290,"[Bitcoin, Bitcoin, BTC, Bitcoin, BTC, SHIB, HO...","$Bitcoin TO $100,000 SOONER THAN YOU THINK‼️💯🙏...",BezosCrypto,SHIB Bezos,True
1,1641579176171016194,"[Bitcoin, bitcoinordinals, crypto]",Alright I have my rares. Who else is grabbing ...,spartantc81,SpartanTC,True
2,1641579486071390208,"[BTC, SHIB, HOGE, SAITAMA, BNB, DOGE, ETH, Bab...","Bitcoin (BTC) Targets Over $100,000 as This Im...",BezosCrypto,SHIB Bezos,True
3,1641579537103302656,[BTC],📢 Xverse Web-based pool is live:\n\n•Update @x...,godfred_xcuz,Algorithm.btc,True
4,1641579588399804418,[Bitcoin],"Yesterday, a Bitcoin projection was displayed ...",goddess81oo,she is lucky,True


In [5]:
df.shape

(1500, 6)

### 1. Username Analysis

In [5]:
# Check for usernames with a high number of digits
df['num_digits_in_username'] = df['username'].apply(lambda x: sum(c.isdigit() for c in x))
high_digit_usernames = df[df['num_digits_in_username'] > 5]  # Arbitrary threshold, can be adjusted

# Check for repetitive characters in usernames
df['repetitive_username'] = df['username'].apply(lambda x: any([x.count(char * 3) > 0 for char in set(x)]))
repetitive_usernames = df[df['repetitive_username']]

# Combine both checks
suspicious_usernames = df[(df['num_digits_in_username'] > 5) | (df['repetitive_username'])]

print(suspicious_usernames[['username', 'user_displayname', 'content']])

             username                user_displayname  \
170    AboMost0568888         TIGER💥 momen RRR ✍✍ 💥💎💎   
174   meshari79441510                  Shibarium News   
197   ryansmi65393560                    k:DeFiRy⛓️🕸️   
435          Bennnoss                         Ben | 🪢   
451   nergizy32132110                       BABY LUNC   
457          m30_2022                           M_30✨   
478   Abhinav18755615                         Abhinav   
508    AboMost0568888         TIGER💥 momen RRR ✍✍ 💥💎💎   
519          vasssek_                       vasek.eth   
587         baatar666                          baatar   
631   michael000Best1       Mike Trollcoin Ambassador   
648     SaqlainAli888                     Saqlain Ali   
667         888Velvet                       MAXi // 📈   
670       ama_serwaaa                      Ama Judith   
706       MoCrypto777                     MoCrypto777   
726   CryptoAnu530154                 🪬 🌐𝕮𝖗𝖞𝖕𝖙𝖔.𝕬𝖓𝖚🌐🪬   
751    AboMost0568888         T

### 2. Hashtag Analysis (This is still not generating satisfactory results)

In [None]:
# Count the number of hashtags in each tweet
df['num_hashtags'] = df['content'].apply(lambda x: len(x.split()) if isinstance(x, str) else 0)

# Flag tweets with more than a typical number of hashtags (e.g., >3)
suspicious_hashtags = df[df['num_hashtags'] > 3]

print(suspicious_hashtags[['username', 'num_hashtags', 'hashtags', 'content']])

### 3. Content Similarity Analysis

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Vectorize the tweet content
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df['content'])

# Calculate cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Set a threshold for similarity (e.g., >0.8) to find suspiciously similar tweets
suspicious_pairs = []
for i in range(len(cosine_sim)):
    for j in range(i+1, len(cosine_sim)):
        if cosine_sim[i, j] > 0.80:
            suspicious_pairs.append((i, j))

# Get the suspicious tweets based on content similarity
for i, j in suspicious_pairs:
    print(f"Tweet {i}: {df.iloc[i]['content']}")
    print(f"Tweet {j}: {df.iloc[j]['content']}")
    print("-------")


Tweet 9: IBIS SACRED, new solana project. metaverse community
LINK : https://t.co/LKBUnJkVIW

Be part of Ibis Sacred Community 

@hey_wallet
  send 13333 $FRONK to the first 500 retweets and follows 

#Solana #bitcoin #FRONK #BONK #CRYPTO #ETH #BSC #BNB #metaverse https://t.co/MOC2QOhfWU
Tweet 110: IBIS SACRED, new solana project. metaverse community
LINK : https://t.co/V0Otvt4e2D

Be part of Ibis Sacred Community 

@hey_wallet
  send 13333 $FRONK to the first 500 retweets and follows 

#Solana #bitcoin #FRONK #BONK #CRYPTO #ETH #BSC #BNB #metaverse https://t.co/uMTEryVKrN
-------
Tweet 13: Free #NFT Giveaway🎁

🚨Drop your $ETH address wallet.

🚨Follow me and Retweet.
#nfts #nft #nftart #nftcommunity #nftcollector #nftartist #digitalart #crypto #cryptoart #art #ethereum #opensea #nftcollectors #blockchain #nftdrop #nftcollectibles #cryptocurrency #Bitcoin https://t.co/ZCFUoUQH1d
Tweet 117: Free #NFT Giveaway🎁

🚨Drop your $ETH address wallet.

🚨Follow me and Retweet.

#nft #nftart #nfts 

Skip the code in the above cell. Below is the same code as a function and will furter go in class.

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def remove_similar_content(df, content_column, similarity_threshold=0.8):
    """
    Removes samples with similar content from a DataFrame based on cosine similarity.

    Parameters:
    - df: pd.DataFrame
        The DataFrame containing the dataset.
    - content_column: str
        The name of the column containing the text content to analyze.
    - similarity_threshold: float (default=0.8)
        The threshold for cosine similarity above which samples are considered similar.

    Returns:
    - pd.DataFrame
        The DataFrame with similar content removed.
    """
    # Vectorize the content column
    tfidf = TfidfVectorizer(stop_words='english')
    tfidf_matrix = tfidf.fit_transform(df[content_column])

    # Calculate cosine similarity matrix
    cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

    # Set to keep track of indices to remove
    indices_to_remove = set()

    # Loop through the similarity matrix to find similar content
    for i in range(len(cosine_sim)):
        for j in range(i + 1, len(cosine_sim)):
            if cosine_sim[i, j] > similarity_threshold:
                # Mark both indices for removal
                indices_to_remove.add(i)
                indices_to_remove.add(j)

    # Create a new DataFrame excluding the indices found to be similar
    df_cleaned = df.drop(index=indices_to_remove).reset_index(drop=True)

    return df_cleaned

# Example usage:
# Assuming your data is in a DataFrame called `df` with a 'content' column
# df_cleaned = remove_similar_content(df, content_column='content')


In [17]:
df = remove_similar_content(df, 'content', 0.65)

In [18]:
df.shape

(1408, 6)

In [19]:
df['sentiment'].value_counts()

sentiment
True     1134
False     274
Name: count, dtype: int64

### 4. Repetitive Phrases (N-gram Analysis)

Detect n-grams that are overly common in tweets.

In [None]:
from collections import Counter
from nltk import ngrams
import nltk

# Make sure to download necessary NLTK data
# nltk.download('punkt')

# Function to get n-grams
def get_ngrams(text, n=3):
    words = nltk.word_tokenize(text)
    return list(ngrams(words, n))

# Apply n-gram extraction
df['ngrams'] = df['content'].apply(lambda x: get_ngrams(x, n=5))

# Flatten the list of n-grams and count frequencies
ngram_list = [ngram for sublist in df['ngrams'] for ngram in sublist]
ngram_freq = Counter(ngram_list)

# Find common n-grams (e.g., those that appear more than 5 times)
common_ngrams = {ngram: count for ngram, count in ngram_freq.items() if count > 5}
print(common_ngrams)


The methods for detecting and removing bot tweets should generally be applied **before** text processing (like removing stop words, lemmatization, punctuation removal, etc.). Here's why:

### 1. **Bot Detection Is More Reliable on Raw Text**
   - **Username and Content Similarity:** Bots often generate content that is templated or repetitive. If you process the text (e.g., removing stopwords or punctuation) before running similarity checks, you might lose some of the patterns that are indicative of bot activity. For example, removing punctuation could change the structure of the text, making similar bot-generated content appear more diverse than it actually is.
   - **N-gram Analysis:** If you preprocess the text before generating n-grams, you might alter the n-grams in a way that makes them less effective for detecting repetitive phrases. For example, lemmatization could merge different forms of a word, potentially masking the repetitive nature of bot-generated text.

### 2. **Cleaner Data for Sentiment Analysis**
   - **After Bot Removal:** Once you've identified and removed potential bot tweets, you can then proceed with text processing like stop word removal, lemmatization, and punctuation removal. This ensures that the remaining text is cleaned and ready for more accurate sentiment analysis.

### Workflow Suggestion:

1. **Raw Data:**
   - **Bot Detection:** Apply the username analysis, content similarity analysis, and possibly n-gram analysis on the raw text to flag and remove suspicious tweets.
   
2. **Text Processing:**
   - **Cleaning:** After bot removal, proceed with standard text processing steps such as:
     - Lowercasing
     - Removing stop words
     - Lemmatization or stemming
     - Removing punctuation
     - Tokenization

3. **Sentiment Analysis:**
   - **Model Training:** Use the cleaned dataset, now with reduced noise from potential bot-generated content, to train your sentiment analysis model.

By following this approach, you ensure that bot detection is based on the full, unaltered text, making it more effective. After bot removal, your sentiment analysis can proceed with clean, human-generated content, leading to more reliable results.