### Student Information
Name: Sofiia Stankevich 妮雅

Student ID: S10910630

GitHub ID: stankevichhhh

Kaggle name: stankevichhhh

Kaggle private scoreboard snapshot:

---

### Instructions

1. First: __This part is worth 30% of your grade.__ Do the **take home exercises** in the [DM2024-Lab2-master Repo](https://github.com/didiersalazar/DM2024-Lab2-Master). You may need to copy some cells from the Lab notebook to this notebook. 


2. Second: __This part is worth 30% of your grade.__ Participate in the in-class [Kaggle Competition](https://www.kaggle.com/competitions/dm-2024-isa-5810-lab-2-homework) regarding Emotion Recognition on Twitter by this link: https://www.kaggle.com/competitions/dm-2024-isa-5810-lab-2-homework. The scoring will be given according to your place in the Private Leaderboard ranking: 
    - **Bottom 40%**: Get 20% of the 30% available for this section.

    - **Top 41% - 100%**: Get (0.6N + 1 - x) / (0.6N) * 10 + 20 points, where N is the total number of participants, and x is your rank. (ie. If there are 100 participants and you rank 3rd your score will be (0.6 * 100 + 1 - 3) / (0.6 * 100) * 10 + 20 = 29.67% out of 30%.)   
    Submit your last submission **BEFORE the deadline (Nov. 26th, 11:59 pm, Tuesday)**. Make sure to take a screenshot of your position at the end of the competition and store it as '''pic0.png''' under the **img** folder of this repository and rerun the cell **Student Information**.
    

3. Third: __This part is worth 30% of your grade.__ A report of your work developing the model for the competition (You can use code and comment on it). This report should include what your preprocessing steps, the feature engineering steps and an explanation of your model. You can also mention different things you tried and insights you gained. 


4. Fourth: __This part is worth 10% of your grade.__ It's hard for us to follow if your code is messy :'(, so please **tidy up your notebook**.


Upload your files to your repository then submit the link to it on the corresponding e-learn assignment.

Make sure to commit and save your changes to your repository __BEFORE the deadline (Nov. 26th, 11:59 pm, Tuesday)__. 

## 1. Home exercises

https://github.com/stankevichhhh/DM2024-Lab2-Master/blob/main/DM2024-Lab2-Master.ipynb

## 2. Participation in the in-class Kaggle Competition
* 26th November screenshot:
 <img src="position_nov_26.png" />

## 3. Report of work developing the model for the competition

In [None]:
# importing required python libraries and modules
import json
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import os

In [None]:
# providing required data
data = []
with open('/kaggle/input/dm-2024-isa-5810-lab-2-homework/tweets_DM.json', 'r') as f:
    for line in f:
        data.append(json.loads(line))
f.close()

In [None]:
# providing data for further training
emotion = pd.read_csv('/kaggle/input/dm-2024-isa-5810-lab-2-homework/emotion.csv')
data_identification = pd.read_csv('/kaggle/input/dm-2024-isa-5810-lab-2-homework/data_identification.csv')

# creating merged DataFrame with all needed information
df = pd.DataFrame(data)
_source = df['_source'].apply(lambda x: x['tweet'])
df = pd.DataFrame({
    'tweet_id': _source.apply(lambda x: x['tweet_id']),
    'hashtags': _source.apply(lambda x: x['hashtags']),
    'text': _source.apply(lambda x: x['text']),
})
df = df.merge(data_identification, on='tweet_id', how='left')

# Splitting data into train and test data
train_data = df[df['identification'] == 'train']
test_data = df[df['identification'] == 'test']
train_data = train_data.merge(emotion, on='tweet_id', how='left')
train_data.drop_duplicates(subset=['text'], keep=False, inplace=True) #removing dublicates

# Preparing training data
train_data_sample = train_data.sample(frac=0.3, random_state=42) #split is 30/70
y_train_data = train_data_sample['emotion']
X_train_data = train_data_sample['text']

# I use LabelEncoder to encode emotion string in numeric category
le = LabelEncoder()
y_train = le.fit_transform(y_train_data)

# splitting further, validation data is 20%
X_train, X_val, y_train, y_val = train_test_split(X_train_data, y_train, test_size=0.2, random_state=42, stratify=y_train)

# Tokenizer which will use 10000 of the most common words in initialized dict
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(X_train)
# Creating tokenized sequences
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_val_seq = tokenizer.texts_to_sequences(X_val)

X_train_pad = pad_sequences(X_train_seq, padding='post')
X_val_pad = pad_sequences(X_val_seq, padding='post')

# Initializing embedding matrix
embedding_dim = 100 # size of vectorized representation of each word
word_index = tokenizer.word_index # tokenized dict
vocab_size = min(len(word_index) + 1, 10000) # initialize the max number of unique words
embedding_matrix = np.random.uniform(-1, 1, (vocab_size, embedding_dim))

# Initializing sequential model with 7 layers
model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, weights=[embedding_matrix], trainable=True),
    Conv1D(filters=128, kernel_size=5, activation='relu'),
    GlobalMaxPooling1D(),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(len(le.classes_), activation='softmax')
])

#1. Embedding: This layer converts integers (indexed words) into dense vectors of fixed length (output_dim). The weights of the layer are initialized by a random matrix (weights=[embedding_matrix]), and they are trainable (trainable=True). 
#2. Conv1D: A convolution layer with 128 filters and a convolution window of size 5. The activating function is ReLU.
#3. GlobalMaxPooling1D: Global pooling layer that selects the maximum value along the time axis (in this case text).
#4. Dropout: A regularization layer that shuts down randomly 50% of neurons during the training phase to prevent overtraining.
#5. Dense: A fully connected layer with 64 neurons and ReLU activation.
#6. Dropout: Another regularization layer with 50% of neurons disabled.
#7. Dense: Output fully connected layer with number of classes equal to the number of labels (len(le.classes_)) and softmax activation for multi-class classification.

# The model is compiled using Adam optimizer, sparse_categorical_crossentropy loss function suitable for multiclass classification tasks, and accuracy metric.
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Model learning
with tf.device('/gpu:0'):
    history = model.fit(
        X_train_pad, y_train,
        validation_data=(X_val_pad, y_val),
        batch_size=128,
        epochs=10,
        verbose=1
    )

test_texts = test_data['text']
test_seq = tokenizer.texts_to_sequences(test_texts)
X_test_pad = pad_sequences(test_seq, padding='post')
y_test_pred = model.predict(X_test_pad)

# Changing predictions into labels
y_test_pred_labels = le.inverse_transform(np.argmax(y_test_pred, axis=1))

In [None]:
# Saving predictions to further submission
submission = pd.DataFrame({
    'id': test_data['tweet_id'],
    'emotion': y_test_pred_labels
})
submission.to_csv('/kaggle/working/submission.csv', index=False)