### Student Information
Name: Ric Chen

Student ID: 110506025

GitHub ID: uilfl

Kaggle name:ricdatadog

Kaggle_notebook : https://www.kaggle.com/code/ricdatadog/newmodel

Kaggle private scoreboard snapshot:
![snapshot](DM2024-Lab2-Homework/Screenshot.png)


---

### Instructions

1. First: __This part is worth 30% of your grade.__ Do the **take home exercises** in the [DM2024-Lab2-master Repo](https://github.com/didiersalazar/DM2024-Lab2-Master). You may need to copy some cells from the Lab notebook to this notebook. 


2. Second: __This part is worth 30% of your grade.__ Participate in the in-class [Kaggle Competition](https://www.kaggle.com/competitions/dm-2024-isa-5810-lab-2-homework) regarding Emotion Recognition on Twitter by this link: https://www.kaggle.com/competitions/dm-2024-isa-5810-lab-2-homework. The scoring will be given according to your place in the Private Leaderboard ranking: 
    - **Bottom 40%**: Get 20% of the 30% available for this section.

    - **Top 41% - 100%**: Get (0.6N + 1 - x) / (0.6N) * 10 + 20 points, where N is the total number of participants, and x is your rank. (ie. If there are 100 participants and you rank 3rd your score will be (0.6 * 100 + 1 - 3) / (0.6 * 100) * 10 + 20 = 29.67% out of 30%.)   
    Submit your last submission **BEFORE the deadline (Nov. 26th, 11:59 pm, Tuesday)**. Make sure to take a screenshot of your position at the end of the competition and store it as '''pic0.png''' under the **img** folder of this repository and rerun the cell **Student Information**.
    

3. Third: __This part is worth 30% of your grade.__ A report of your work developing the model for the competition (You can use code and comment on it). This report should include what your preprocessing steps, the feature engineering steps and an explanation of your model. You can also mention different things you tried and insights you gained. 


4. Fourth: __This part is worth 10% of your grade.__ It's hard for us to follow if your code is messy :'(, so please **tidy up your notebook**.


Upload your files to your repository then submit the link to it on the corresponding e-learn assignment.

Make sure to commit and save your changes to your repository __BEFORE the deadline (Dec. 8th, 11:59 pm, Sunday)__. 

Report: Model Development for Text Classification Competition

Introduction

This report outlines the development of a text classification model for the competition, incorporating various preprocessing techniques, feature engineering steps, and modeling approaches. The primary goal was to analyze and classify tweets based on their emotional tone or content. The report discusses the methodologies used, challenges encountered, and insights gained during the process.

Data Preprocessing

Effective data preprocessing is a cornerstone of any text-based machine learning project. Several steps were implemented to ensure the dataset was clean and ready for modeling.

Steps Taken:
	1.	Handling Missing Values:
	•	Removed rows where critical columns like text or emotion had missing values.
	2.	Tokenization:
	•	Tweets were tokenized into individual words using nltk.
	3.	Stopword Removal:
	•	Standard stopwords were removed to reduce noise.
	•	Identified the need to expand the stopwords list with domain-specific terms in future iterations.
	4.	Special Character Cleaning:
	•	Removed Twitter-specific symbols (@, #, URLs) and other non-alphanumeric characters.
	•	Retained only the textual content necessary for emotion classification.
	5.	Label Encoding:
	•	Encoded emotion labels into integers using LabelEncoder.
	6.	Word Embeddings:
	•	Used pretrained Word2Vec embeddings to represent tokens as dense vectors. Sentence-level embeddings were computed as the mean of word vectors.

Feature Engineering

To enhance the dataset’s representation, multiple feature engineering approaches were employed:
	1.	TF-IDF Vectorization:
	•	Initially used TF-IDF to transform text into sparse numerical vectors.
	•	Provided a strong baseline for simple models but lacked the contextual understanding required for deeper insights.
	2.	Pretrained Embeddings:
	•	Integrated Word2Vec to generate dense, context-aware vector representations of text.
	•	These embeddings formed the basis for clustering and classification models.
	3.	Clustering for Insights:
	•	Applied DBSCAN to cluster tweets based on their embeddings.
	•	Visualized clusters using PCA to reduce dimensionality and identify patterns.

Model Development

Deep Learning Approach
	•	Developed a deep learning model to classify text based on TF-IDF features.
	•	Key characteristics:
	•	Integrated an embedding layer to process text features.
	•	Trained the model with multiple epochs, observing convergence and overfitting trends.
	•	Results:
	•	Achieved approximately 50% accuracy, indicating the need for architectural refinement and enhanced feature representation.

Classical Machine Learning Models

To complement the deep learning approach, traditional models were also explored:
	1.	Random Forest Classifier:
	•	Trained on Word2Vec embeddings.
	•	Delivered competitive accuracy with an interpretable decision-making process.
	2.	Naive Bayes Classifier:
	•	Applied to TF-IDF features.
	•	Performed well for initial text representation but struggled with nuanced classes.

Insights and Challenges

Insights:
	•	Word embeddings provided a richer feature representation compared to TF-IDF.
	•	Clustering revealed underlying patterns and the presence of noise in the dataset.
	•	Different models highlighted the trade-off between complexity and interpretability.

Challenges:
	•	Handling imbalanced classes in the dataset.
	•	Achieving alignment between model outputs and competition requirements (e.g., ensuring 411,972 rows in predictions).
	•	Fine-tuning hyperparameters for optimal performance across diverse models.

Recommendations for Future Work
	1.	Enhanced Preprocessing:
	•	Develop a more comprehensive cleaning pipeline tailored to Twitter data.
	•	Expand the stopwords list with competition-specific terms.
	2.	Experiment with Pretrained Models:
	•	Leverage transformer-based embeddings (e.g., BERT, RoBERTa) for deeper contextual understanding.
	3.	Fine-Tune Deep Learning Models:
	•	Experiment with architectures like LSTM, GRU, and transformers.
	•	Use learning rate schedulers and regularization techniques to prevent overfitting.
	4.	Augment Data for Class Imbalance:
	•	Apply oversampling techniques like SMOTE.
	•	Generate synthetic text samples using NLP-based augmentation tools.
	5.	Hybrid Model Approach:
	•	Combine predictions from classical and deep learning models to balance interpretability and performance.

Conclusion

The project successfully demonstrated the integration of multiple preprocessing, feature engineering, and modeling techniques. While the current models provided a solid foundation, future iterations focusing on advanced embeddings, better preprocessing, and enhanced architectures will likely yield significant improvements. This iterative process showcases the potential of machine learning in text classification tasks and provides valuable lessons for similar challenges.



In [4]:
#Code Summary
#Below is a summary of the code used to preprocess data, engineer features, and train models:

# Load and preprocess the dataset

'''
df = emotion_tweets.dropna(subset=['text', 'emotion'])
df['tokens'] = df['text'].apply(word_tokenize)
df['emotion'] = LabelEncoder().fit_transform(df['emotion'])

# Generate Word2Vec embeddings
w2v_model = Word2Vec(sentences=df['tokens'], vector_size=100, window=5, min_count=1, workers=4)
df['embedding'] = df['tokens'].apply(lambda x: get_sentence_embedding(x, w2v_model))

# Train-test split
X = np.array(df['embedding'].tolist())
y = df['emotion']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Random Forest
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
rf_preds = rf_model.predict(X_test)

# Evaluate and visualize
print(classification_report(y_test, rf_preds))

'''



"\ndf = emotion_tweets.dropna(subset=['text', 'emotion'])\ndf['tokens'] = df['text'].apply(word_tokenize)\ndf['emotion'] = LabelEncoder().fit_transform(df['emotion'])\n\n# Generate Word2Vec embeddings\nw2v_model = Word2Vec(sentences=df['tokens'], vector_size=100, window=5, min_count=1, workers=4)\ndf['embedding'] = df['tokens'].apply(lambda x: get_sentence_embedding(x, w2v_model))\n\n# Train-test split\nX = np.array(df['embedding'].tolist())\ny = df['emotion']\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n\n# Random Forest\nrf_model = RandomForestClassifier(random_state=42)\nrf_model.fit(X_train, y_train)\nrf_preds = rf_model.predict(X_test)\n\n# Evaluate and visualize\nprint(classification_report(y_test, rf_preds))\n\n"