# DSC360 - Week 5 - Exercise 5.2

We begin the exercises this week by providing a section for attribution and description.

We begin by also importing any necessary libraries or files for the notebook. This library was used as there were soft warnings when running the code. This would not be done in production but was done here to assist with readability. 

In [1]:
import warnings
warnings.filterwarnings('ignore')

***

## Exercises 5.2

Using the file, *twitter_sample.csv*, which can be found in the data directory in the Week 5 GitHub repository:

## Cleaning the "Tweet Content" Column

**1. Clean the "Tweet Content" column by removing non-text data and stop words.**

In [2]:
# Import the necessary libraries
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords

# Define class for cleaning tweet content
class TweetCleaner:
    def __init__(self):
        self.stop_words = set(stopwords.words('english'))

    # Define function to clean tweet content
    def clean(self, text):
        # Remove unnecessary characters
        text = re.sub(r'http\S+|www\S+|https\S+|#\S+|@\S+', '', text, flags=re.MULTILINE)
        text = re.sub(r'\W', ' ', text)
        text = re.sub(r'\s+', ' ', text)
        # Convert to lower case
        text = text.strip().lower()

        # Remove stop words
        text = ' '.join([word for word in text.split() if word not in self.stop_words])
        return text

# Load data
df = pd.read_csv(r'C:\Users\thefli0\Downloads\twitter_sample.csv')

a. When examining the data in the "twitter_sample.csv" file, characters were present (i.e., emojis, symbols, etc.) that should not be included. A class was created to assist with the removal of the non-text data and stop words. This was created as such to encapsulate the cleaning logic.

***

## Building BOW and TF-IDF Vectorizer Representations

**2. Filtering only tweets (not re-tweets) use your class from part one of this exercise to build BOW and TF-IDF Vectorizer representations of the text; print your results. Don't overthink this, leverage what the author does in the text. *NOTE*: the CountVectorizer class in sklearn has changed slightly. You'll see in the text on page 210 that there's a function called get_feature_names(). This is now get_feature_names_out(). The TfidfVectorizer class has this same change.**

In [3]:
# Filter tweets prior to cleaning
df_filtered = df[df['Tweet Type'] == 'Tweet']

# Initialize TweetCleaner class
cleaner = TweetCleaner()

# Apply cleaning function to "Tweet Content" column
df_filtered['Cleaned Tweet Content'] = df_filtered['Tweet Content'].apply(cleaner.clean)

# Filter empty tweets after cleaning
df_filtered = df_filtered[df_filtered['Cleaned Tweet Content'].str.strip() != '']

a. This code cell was developed to ensure filtering of 'Tweet Type' occurred before cleaning the 'Tweet Content'. This filtering also ensured that only original tweets (not retweets) were processed. When this initial step was not included, no content was returned. This step allowed both filtering and then application of the 'TweetCleaner' class.

In [4]:
# Import the necessary libraries
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Create BOW and TF-IDF Vectorizer representations
bow_vectorizer = CountVectorizer()
tfidf_vectorizer = TfidfVectorizer()

# Check 'Cleaned Tweet Content' not empty
if df_filtered['Cleaned Tweet Content'].str.strip().empty:
    print("No valid content to vectorize after cleaning.")
else:
    # Fit and transform filtered tweets
    bow_matrix = bow_vectorizer.fit_transform(df_filtered['Cleaned Tweet Content'])
    tfidf_matrix = tfidf_vectorizer.fit_transform(df_filtered['Cleaned Tweet Content'])

    # Convert to DataFrames for inspection
    bow_df = pd.DataFrame(bow_matrix.toarray(), columns=bow_vectorizer.get_feature_names_out())
    tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

    # Print results
    print("Bag of Words (BOW) representation:\n", bow_df)
    print("\nTF-IDF representation:\n", tfidf_df)

Bag of Words (BOW) representation:
     000  000th  0303  10  118  12  2010  2019  21  210  ...  wyprodukowania  \
0     0      0     0   0    0   0     0     0   0    0  ...               0   
1     0      0     1   0    1   0     0     0   0    0  ...               0   
2     0      0     0   0    0   0     0     0   0    0  ...               0   
3     0      0     0   0    0   0     0     0   0    0  ...               0   
4     0      0     0   0    0   0     0     0   0    0  ...               0   
..  ...    ...   ...  ..  ...  ..   ...   ...  ..  ...  ...             ...   
90    0      0     0   0    0   0     0     0   0    1  ...               0   
91    0      0     0   0    0   0     0     0   0    0  ...               0   
92    0      0     0   0    0   0     0     0   0    0  ...               0   
93    0      0     0   0    0   0     0     0   0    0  ...               0   
94    0      0     0   0    0   0     0     0   0    0  ...               0   

    year  years

b. The 'sklearn' library was imported to perform vectorization. An 'if/else' function was included to ensure that filtering did not inadvertently filter erroneously. Once content was noted to be present, the BOW and TF-IDF models could be fit and transform the data into vectors. The BOW representation displays a table where each row represents a tweet and each column represents a word. This allows you to see which words are most common in each tweet and how similar tweets are in word usage. Similarly, TF-IDF displays a table where each row is a tweet and each column represents a word. The scores that are returned for this matrix reflect how important each word is in the context of the tweet and to the entire dataset. These data were then converted to pandas DataFrames for further inspection and the results were printed accordingly.

***

## Finding Document Similarity Using Cosine Similarity

**3. Find one or more documents (each tweet is a document) that are similar to each other using Cosine Similarity; print your results. (NOTE: The lower the Cosine Similarity, the more likely the documents are similar.)**

In [5]:
# Import necessary libraries
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Calculate Cosine Similarity
cosine_similarities = cosine_similarity(tfidf_matrix)

# Convert DataFrame for readability
similarity_matrix = pd.DataFrame(cosine_similarities, index=df_filtered['Tweet Id'], columns=df_filtered['Tweet Id'])

# Print results
print("Cosine Similarity Matrix:\n", similarity_matrix)

Cosine Similarity Matrix:
 Tweet Id               "1167429261210218497"  "1167375334670557185"  \
Tweet Id                                                              
"1167429261210218497"               1.000000               0.000000   
"1167375334670557185"               0.000000               1.000000   
"1167228285463531520"               0.000000               0.000000   
"1167061075075973123"               0.000000               0.000000   
"1166892836165496835"               0.000000               0.000000   
...                                      ...                    ...   
"1145799585446715394"               0.000000               0.036298   
"1145685620016242688"               0.000000               0.000000   
"1145672954371608576"               0.042939               0.000000   
"1145637718493290497"               0.000000               0.000000   
"1145609117618397184"               0.116204               0.000000   

Tweet Id               "1167228285463531520"  "11

a. For the final step of the exercise, Cosine Similarity was imported from the 'sklearn' library. Cosine Similarity was used to compute the similarity between every pair of tweets. After the calculation for Cosine Similarity was performed, the matrix was converted to a pandas DataFrame for readability using the "Tweet Id". The results were returned showing the cosine similarity score between every pair of tweets. The closer to 1 a score is, the more similar the tweets are, while a score closer to 0 indicates dissimilarity.