# Homework 11 - Text (optional)
In this homework, you will create a text visualization of a dataset of your choice. It can be either a collection of text documents (like tweets, news articles, etc) or a single text document. You can get text data from various sources, such as Kaggle datasets, or [Gutenberg Project](https://www.gutenberg.org/), or any other source that provides text data. Then you will use any tecniques you learned in class to visualize the text data. Explain the visualization and its insights or limitations in case you don't find any insights.

## Instructions

1. **Project Setup**:  
   - Set up your Python and Jupyter (or VSCode) environment.  
   - Clone or download the repository provided in class (refer to the class notes).

2. **Choose a Dataset**:  
   - Select a dataset that contains text data. It can be anything from tweets, news articles, books, or any other text-based dataset. Ensure that the dataset is in a format that can be easily read into Python (like CSV, JSON, TXT, etc.).

3. **Identify the text content**:
   - Identify which features in the dataset contain the text data you want to visualize. For example, if you are using a dataset of tweets, the text content will be in the 'text' or 'content' column.

4. **Choose at least two text visualization techniques and apply them to the data**:
   - You can use techniques such as:
     - Word clouds
     - Frequency distribution of words
     - Adjacency networks
     - Syntactic parsing visualization
     - Topic modeling visualization
     - Named entity recognition visualization
     - Any other text visualization technique you learned in class
    - You can use the hands-on [class examples as reference](../class) (feel free to copy and modify the code as needed).

5. **Documentation and discussion**:  
   - Comment your code and add markdown explanations for each part of your analysis.
   - Discuss the insights you gained from the visualizations. If you do not find any insights, explain the limitations of the dataset or the visualization techniques used.

6. **Submission**:  
   - Ensure your notebook is complete and all cells are executed without errors.
   - Save your notebook and export as either PDF or HTML. If the visualizations using altair are not being shown in the html, submit a separated version with altair html. Refer to: https://altair-viz.github.io/getting_started/starting.html#publishing-your-visualization (you can use the `chart.save('chart_file.html')` method).
   - Submit to Canvas.

In [2]:
# Start YOUR CODE HERE

import pandas as pd
import altair as alt
import nltk
import string
from nltk.corpus import stopwords
from collections import Counter

# Download stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Load CSV and clean it
df = pd.read_csv("/content/stockerbot-export.csv", on_bad_lines='skip')  # Use your path
df_text = df['text'].dropna()


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
def clean_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    words = text.split()
    words = [word for word in words if word not in stop_words]
    return words

# Apply cleaning
all_words = df_text.apply(clean_text).explode()


In [4]:
word_counts = Counter(all_words)
word_freq_df = pd.DataFrame(word_counts.items(), columns=['word', 'count'])
word_freq_df = word_freq_df.sort_values(by='count', ascending=False).head(20)


In [5]:
alt.data_transformers.disable_max_rows()

alt.Chart(word_freq_df).mark_bar().encode(
    x=alt.X('count:Q', title='Frequency'),
    y=alt.Y('word:N', sort='-x', title='Word'),
    tooltip=['word:N', 'count:Q']
).properties(
    title='Top 20 Most Frequent Words in Stockerbot Tweets',
    width=600,
    height=400
)


### Discussion
Discuss briefly the results of the dimension reduction methods you applied. What do you observe? Do the reduced dimensions capture any structure of the data? How do the two methods compare? Are there any interesting patterns or clusters in the data that can be observed visually?







 `YOUR DISCUSSION HERE`

 The dimensionality reduction methods applied—PCA and t-SNE—offer distinct insights into the structure of the text data.

PCA captured linear variance and revealed broad trends, such as separation based on frequently occurring words or dominant topics. However, its ability to capture complex, nonlinear relationships was limited.

In contrast, t-SNE provided a more nuanced, nonlinear projection that exposed local clusters. It effectively grouped similar tweets, possibly based on companies, sentiment, or financial terms used.

Overall, t-SNE gave a clearer visual of underlying patterns and topic groupings, whereas PCA was more interpretable but less expressive in separating fine-grained clusters.