# ML Course 2023 |  Sentiment Analysis in Twitter Challenge
You can check the updated leaderboard in this [link](https://nimble-hellebore-184.notion.site/ML-Course-2023-Sentiment-Analysis-in-Twitter-Challenge-966b041e7aec4f2eabbc8dc33d64b871).

In [None]:
!pip3 install tueplots==0.0.5
!pip3 install sentence-transformers==2.2.2


In [None]:
!pip3 install WordCloud
!pip3 install seaborn
!pip3 install -U imbalanced-learn
!pip3 install xgboost


# Load Tweets

The dataframe of tweets contain the following columns:

- `id`: The unique identifier of the tweet
- `text`: The content of the tweet
- `type`: The type of tweet, which can be 'tweet', 'quoted', 'retweeted' or 'quoted__replied_to'
- `author_id`: The unique identifier of the author of the tweet
- `possibly_sensitive`: A boolean value indicating whether the tweet contains sensitive content
- `retweet_count`: The number of times the tweet has been retweeted
- `quote_count`: The number of times the tweet has been quoted
- `reply_count`: The number of times the tweet has been replied to
- `like_count`: The number of times the tweet has been liked
- `followers_count`: The number of followers of the author of the tweet
- `following_count`: The number of accounts the author of the tweet is following
- `tweet_count`: The total number of tweets made by the author of the tweet
- `listed_count`: The number of lists the author of the tweet is a member of
- `score_compound`:  A numerical value ranging from -1 to 1 indicating the overall sentiment of the tweet, where -1 represents  negative sentiment and 1 represents positive sentiment. **This is the target variable for the regression task.**
- `sentiment`: A categorical variable indicating the sentiment of the tweet, which can be 'negative', 'neutral' or 'positive'. **This is the target variable for the classification task.**




In [None]:
import os
import pandas as pd
pd.set_option('display.max_rows', 100)

from wordcloud import WordCloud

import matplotlib.pyplot as plt
import seaborn as sns
from tueplots import bundles

plt.rcParams.update(bundles.icml2022())
import tueplots.constants.color.palettes as tue_palettes

from sentence_transformers import SentenceTransformer

import matplotlib as mpl
mpl.rcParams.update(mpl.rcParamsDefault)

import numpy as np


In [None]:
team_id = '1' #put your team id here
split = 'test_1' # replace by 'test_2' for FINAL submission

df = pd.read_csv('tweets_train.csv') # we are subsampling 100 training data
df_test = pd.read_csv(f'tweets_{split}.csv') # we are subsampling 200 test data


In [None]:
df[df.type == 'tweet'].head()

In [None]:
df_test.head()

# Pre-process tweets

The following are the preprocessing steps we followed to get the `words` column from the original tweet, which corresponds to the `text` column of the dataframe.

- Remove punctuations, special characters, mentions, links, and numbers from the tweets.
- Convert all the tweets to lowercase.
- Tokenize the tweets into individual words.
- Remove stop words, such as "and", "the", "a", etc.
- Perform stemming or lemmatization on the remaining words to convert them to their base form.
- Filter out any words that occur infrequently in the corpus to reduce the dimensionality of the data.
- Create a bag of words representation of the tweets, where each tweet is represented as a vector of word frequencies.


**Note:** Lemmatization is a process in natural language processing where words are reduced to their base form, or lemma. This is done by removing inflections, such as pluralization or verb conjugation, and converting the word to its dictionary form. The result of this process is a word that is more easily recognizable, and can be used to improve the accuracy of NLP models, such as the LDA model. By lemmatizing the words in a corpus of text, the dimensionality of the data is reduced, and the relationships between words become clearer, making it easier to identify patterns and themes within the text.


In [None]:
df['words_str'] = df['words'].apply(lambda words: ' '.join(eval(words)))
df_test['words_str'] = df_test['words'].apply(lambda words: ' '.join(eval(words)))


In [None]:
df['words'].head()

In [None]:
df['words_str'].head()

# Preprocessing data

The following processes are used to reduce the dimension of the variables and to normalize all the values

In [None]:
df_wo_label = df.drop(['sentiment', 'words', 'words_str', 'text', 'type', 'possibly_sensitive'], axis = 1)

In [None]:
df_test_wo_label = df_test.drop(['words', 'words_str', 'text', 'type', 'possibly_sensitive'], axis = 1)

In [None]:
from sklearn.decomposition import PCA
from sklearn import preprocessing

In [None]:
# better to do it every time we train and validate the model
df_norm = preprocessing.StandardScaler().fit_transform(df_wo_label)

# Visualize content of the tweets

Join all of the preprocessed tweets together and create a world cloud of them to see most frequently used words among all tweets.

In [None]:
long_string = ','.join(list(df['words_str'].values))
# Create a WordCloud object
wordcloud = WordCloud(font_path = 'C:/Users/bodhi/AppData/Local/Microsoft/Windows/Fonts/times.ttf', background_color="white", max_words=5000, contour_width=3, contour_color='steelblue')
# Generate a word cloud
wordcloud.generate(long_string)
# Visualize the word cloud
wordcloud.to_image()

# Sentiment Analysis

**In this part, we will visualize the distribution of these possible sentiments in our dataset.**

Each tweet in our dataset have one of three sentiments (`sentiment`):

*   Positive
*   Neutral
*   Negative

Also, each tweet has a continous score (`score_compound`) between [-1,1] where -1 corresponds to negative and 1 corresponds to a positive sentinement.

In [None]:
df_pos = df[df.sentiment == 'positive']
df_neu = df[df.sentiment == 'neutral']
df_neg = df[df.sentiment == 'negative']


num_total = len(df)
num_pos = len(df_pos)
num_neu = len(df_neu)
num_neg = len(df_neg)

print(f"Num. positive tweets: {num_pos} ({num_pos/num_total*100:.2f}%)")
print(f"Num. negative tweets: {num_neg} ({num_neg/num_total*100:.2f}%)")
print(f"Num. neutral tweets: {num_neu} ({num_neu/num_total*100:.2f}%)")


In [None]:
plt.close('all')

fig, ax = plt.subplots(1, 1)

labels = []
labels.append(f"Positive ({num_pos/num_total*100:.2f}%)")
labels.append(f"Neutral ({num_neu/num_total*100:.2f}%)")
labels.append(f"Negative ({num_neg/num_total*100:.2f}%)")

sizes = [num_pos, num_neu, num_neg]

colors = [f"#{i}" for i in tue_palettes.high_contrast[:3]]

_ = ax.pie(sizes,colors=colors, startangle=90)
# plt.style.use(default’)
ax.legend(labels,
          loc='upper center', 
          bbox_to_anchor=(1.23, 1.0), 
          fancybox=True, 
          shadow=True)

ax.set_title("Sentiment Analysis")
plt.tight_layout()
plt.show()

In [None]:
plt.close('all')

sns.countplot(x=df.sentiment, palette=colors)
plt.show()

In [None]:
plt.close('all')

sns.violinplot(data=df, x='sentiment', y='score_compound', palette=colors)
plt.show()

# Check the correlation between data and target

In [None]:
from scipy.stats import shapiro

In [None]:
df_norm = pd.DataFrame(np.squeeze(df_norm), columns=list(df_wo_label.columns))

In [None]:
df_norm

In [None]:
# using spearman because the data is not normally distributed
correlation = df_norm.corr('spearman')

In [None]:
plt.figure(figsize=(10,5), dpi =100)
sns.heatmap(correlation,annot=True,fmt=".2f", linewidth=.5)
plt.show()

the correlation between each categories and the score compound is not very strong, but maybe it could help a little bit on our result later.

# Using PCA

In [None]:
from sklearn.decomposition import PCA, FastICA

In [None]:
df_pca = PCA(n_components = 0.95, whiten = True).fit(df_norm.drop(['score_compound'], axis = 1))

In [None]:
df_pca.explained_variance_ratio_

to get 95% of the variance, we need 9 dimension, which is just the total -1. I do not think using PCA will be useful

# PLOT that may be useful

# Obtain the text embeddings

When working with natural language processing tasks, such as text classification, it is common to use word embeddings to represent the meaning of words and sentences. Word embeddings are dense vectors that capture the semantic relationships between words in a way that allows for easier processing by machine learning algorithms.

The process of creating word embeddings involves training a neural network on a large corpus of text data. However, pre-trained word embeddings are readily available online and can be downloaded and used in your projects. See a complete list of pre-trained models [here](https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/sts-models.md). 


**Note:** When working with pre-trained models, it is important to keep in mind the computational resources required to generate the embeddings. Depending on the size of the model and the amount of text data being processed, generating embeddings may take a significant amount of time. Therefore, it is advisable to save the embeddings locally once they have been generated, to avoid the need to re-generate everytime you may want to make changes in the model (but not in the embedding).



In [None]:
# name = 'stsb-distilbert-base'
# name = 'all-mpnet-base-v2'
name = 'stsb-mpnet-base-v2'
# name = 'bert-base-nli-mean-tokens'
# name = 'average_word_embeddings_komninos'
model = SentenceTransformer(name)


In [None]:
sentences = list(df.words_str.values)
sentence_embeddings = model.encode(sentences)
np.save('stsb_mpnet_base_v2_embeddings_all.npy', sentence_embeddings)

In [None]:
sentence_embeddings = np.load("stsb_mpnet_base_v2_embeddings_all.npy")

In [None]:
sentence_embeddings.shape

Aside from the sentence embeddings, we can try to use another nlp method to vectorize the words with "Bag of Words" method and "Gram" method. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import pickle

In [None]:
count_vector = {
    'words11g.pkl': None,
    'words11g100.pkl': None,
    'words11g500.pkl': None,
    'words11g1000.pkl': None,
    'words11g2500.pkl': None,
    'words11g5000.pkl': None,
    'words22g100.pkl' : None,
    'words22g500.pkl' : None,
    'words22g1000.pkl' : None,
    'words22g2500.pkl' : None,
    'words22g5000.pkl' : None,
    'words22g10000.pkl' : None,
    'words22g20000.pkl' : None,
    'words22g25000.pkl' : None,
    'words22g30000.pkl' : None,
    'words22g40000.pkl' : None,
    'words22g50000.pkl' : None,
    'words22g60000.pkl' : None,
    'words22g70000.pkl' : None,
    'words22g80000.pkl' : None,
    'words33g.pkl' : None,
    'words33g100.pkl' : None,
    'words33g500.pkl' : None,
    'words33g2500.pkl' : None,
    'words33g1000.pkl' : None,
    'words33g5000.pkl' : None,
    'words33g10000.pkl' : None,
    'words33g25000.pkl' : None,
    'words33g50000.pkl' : None,
    'words33g75000.pkl' : None,
    'words12g.pkl' : None,
    'words12g100.pkl' : None,
    'words12g500.pkl' : None,
    'words12g1000.pkl' : None,
    'words12g2500.pkl' : None,
    'words12g5000.pkl' : None,
    'words12g10000.pkl' : None,
    'words12g25000.pkl' : None,
    'words12g50000.pkl' : None,
    'words12g75000.pkl' : None,
    'words23g.pkl' : None,
    'words23g100.pkl' : None,
    'words23g500.pkl' : None,
    'words23g1000.pkl' : None,
    'words23g2500.pkl' : None,
    'words23g5000.pkl' : None,
    'words23g10000.pkl' : None,
    'words23g25000.pkl' : None,
    'words23g50000.pkl' : None,
    'words23g75000.pkl' : None,
    'words23g100000.pkl' : None,
    'words23g125000.pkl' : None,
    'words23g150000.pkl' : None,
    'words13g.pkl' : None,
    'words13g100.pkl' : None,
    'words13g500.pkl' : None,
    'words13g1000.pkl' : None,
    'words13g2500.pkl' : None,
    'words13g5000.pkl' : None,
    'words13g10000.pkl' : None,
    'words13g25000.pkl' : None,
    'words13g50000.pkl' : None,
    'words13g75000.pkl' : None,
    'words13g100000.pkl' : None,
    'words13g125000.pkl' : None,
    'words13g150000.pkl' : None,
    

}


We will analyze this. First we make few comparisons between BoW with 1-gram, 2-gram, 3-gram, and the union of each categories and also we will look at the top 10% words.

In [None]:
count_vector

this has more than 16k unique words, I think we will analyze using 100, 500, 1000, 2500, 5000 top words

In [None]:

def save_pickle(file_name):
    with open('count_vectorizer/'+ file_name, 'wb') as f:
        pickle.dump(count_vector[file_name], f)
        
def load_pickle(file_name):
    with open('count_vectorizer/'+ file_name, 'rb') as f:
        count_vector[file_name] = pickle.load(f)


for key in count_vector.keys():
    try:
        load_pickle(key)
    except:
        ngram = (int(key[5]), int(key[6]))
        max_feature = key[key.find('g') + 1:key.find('.')]
        try:
            count_vector[key] = CountVectorizer(ngram_range = ngram, max_features = int(max_feature)).fit_transform(df['words_str'])
        except:
            count_vector[key] = CountVectorizer(ngram_range = ngram).fit_transform(df['words_str'])
        finally:
            save_pickle(key)
        

In [None]:
count_vector

we see that we have a lot of unique words. This of course eats a lot of computational power, so we will analyze it further with 10%, 25% of datasets

we will use linear regression to analyze this.

In [None]:
def test_model(strategy, model, X, y):
    sum_rmse = 0
    length = 0
    for (train, test) in strategy.split(X, y):
#         do i need to normalize the input ? I do not think so ?
        
#         fitting the model
        reg = model.fit(X[train], y[train])
        y_pred = reg.predict(X[test])
        
#         calculating the accuracy
        rmse = np.sqrt(skm.mean_squared_error(y[test], y_pred))
        sum_rmse += rmse
        length += 1
    return sum_rmse/length
#     print("\n strategy = ", strategy, "\n model = ", model, "\n avg_rmse = ", sum_rmse / length)


In [None]:
#strategy for model selection
kf5 = KFold(n_splits = 5)
kf10 = KFold(n_splits = 10)

In [None]:
lr = linear_model.LinearRegression(fit_intercept=True, copy_X=True, n_jobs=None, positive=False)

In [None]:
rmse_vector = dict()
for key in count_vector:
    rmse = test_model(kf10, lr, count_vector[key], df['score_compound'])
    rmse_vector[key] = rmse
rmse_vector

In [None]:
# this is just example, in reality i will not use this.
# make data
x = list(rmse_vector.keys())
y = rmse_vector.values()

fig, ax = plt.subplots()
plt.rcParams["figure.figsize"] = (100,30)
# ax.stem(x, y)
markerline, stemlines, baseline = ax.stem(x, y)
plt.setp(stemlines, 'linewidth', 30)
plt.savefig('res.jpg')
plt.show()

In [None]:
print("TOP 5 RMSE \n")
for e in sorted(rmse_vector, key=rmse_vector.get)[:5]:
    print(e + " = " + str(rmse_vector[e]))

trying to use TF-IDF to these  5 models

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer as tfidf

In [None]:
tfidf_vector = dict()

def save_pickle(file_name):
    with open('tfidf/' + file_name, 'wb') as f:
        pickle.dump(tfidf_vector[file_name], f)
        
def load_pickle(file_name):
    with open('tfidf/' + file_name, 'rb') as f:
        tfidf_vector[file_name] = pickle.load(f)

for key in count_vector.keys():
    try:
        load_pickle(key)
    except:
        tfidf_vector[key] = tfidf().fit_transform(count_vector[key])
        save_pickle(key)

In [None]:
tfidf_vector

In [None]:
rmse_tfidf_vector = dict()
for key in tfidf_vector:
    rmse = test_model(kf10, lr, tfidf_vector[key], df['score_compound'])
    rmse_tfidf_vector[key] = rmse
rmse_tfidf_vector

In [None]:
print("TOP 5 RMSE \n")
for e in sorted(rmse_tfidf_vector, key=rmse_tfidf_vector.get)[:5]:
    print(e + " = " + str(rmse_tfidf_vector[e]))

# problem is that how to incorporate another dimension so that the weight is good ?

In [None]:
dfa = df_norm.drop(['score_compound'], axis = 1).to_numpy()

In [None]:
words1g1000.toarray()

In [None]:
dfa = np.append(dfa, words12g75000.toarray(), axis = 1)

In [None]:
dfa.shape

In [None]:

test_model(kf10, lr, dfa, df['score_compound'])

In [None]:

test_model(kf10, lr, sentence_embeddings, df['score_compound'])

In [None]:

sgd = linear_model.SGDRegressor(alpha = 0.001,
 epsilon= 0.5,
 learning_rate= 'adaptive',
 loss= 'squared_error',
 penalty= 'l2',
 tol= 0.001)

In [None]:

test_model(kf10, sgd, sentence_embeddings, df['score_compound'])

In [None]:

test_model(kf10, sgd, words12g75000_tfidf, df['score_compound'])

if we have time, let's try to make all of them.
but for now, we will pick words12g75000_tfidf

# Trying to test with many regression models

we will use the top 5 tfidf and Bag of Words for testing the models.

another problem to note is the imbalance of the data. Maybe try to balance it first by making more data points? Or pruning them.

using SMOTE iirc, we can do it but for categorical values.

In [None]:
from sklearn.linear_model import LinearRegression as LR
from sklearn.linear_model import Lasso, Perceptron
from sklearn.linear_model import BayesianRidge as BR
from sklearn.linear_model import SGDRegressor as SGD
from sklearn.svm import SVR
from sklearn.neighbors import RadiusNeighborsRegressor as RNR, KNeighborsRegressor as KNR
from sklearn.tree import DecisionTreeRegressor as DTR
import xgboost

In [None]:
# model

model = {
    'lr' : LR(n_jobs = -1),
    'lasso' : Lasso(),
    'sgd': SGD(),
#     'br' : BR(), have to be dense matrix, am lazy to do it
    'svr' : SVR(),
    'knr' : KNR(n_jobs = -1),
#     'dtr' : DTR(),
    'xgb': xgboost.XGBRegressor(n_jobs = -1)
}

In [None]:
rmse_model_vector = dict()

for key in model.keys():
    rmse_score = dict()
    for e in sorted(rmse_vector, key=rmse_vector.get)[:3]:
        rmse = test_model(kf10, model[key], count_vector[e], df['score_compound'])
        rmse_score[e] = rmse
    rmse_model_vector[key] = rmse_score
rmse_model_vector

In [None]:
rmse_tfidf_model_vector = dict()
for key in model.keys():
    rmse_score = dict()
    for e in sorted(rmse_tfidf_vector, key=rmse_tfidf_vector.get)[:3]:
        rmse = test_model(kf10, model[key], tfidf_vector[e], df['score_compound'])
        rmse_score[e] = rmse
    rmse_tfidf_model_vector[key] = rmse_score
rmse_tfidf_model_vector

In [None]:
for key in model.keys():
    print(test_model(kf10, model[key], sentence_embeddings, df['score_compound']))

we will take xgb and lr and try to enhance the accuracy

In [None]:
from sklearn.model_selection import GridSearchCV
import math

In [None]:
def rmse_func(y_true, y_pred):
    return Math.sqrt(pow((y_true - y_pred), 2) / n)

In [None]:
xgb = xgboost.XGBRegressor(n_jobs = -1, verbosity = 3, eval_metric = 'rmse', tree_method = 'gpu_hist')

In [None]:
xgb_param = {
#     'n_estimators' : [750, 1000, 1250]
    'max_depth': [3, 5,6,7, 9]
}

In [None]:
res = GridSearchCV(xgb, xgb_param, n_jobs = -1, cv = 10, error_score = 'raise', scoring = metrics.mean_squared_error)

In [None]:
res.fit(count_vector['words12g.pkl'], df['score_compound'])

In [None]:
res.best_params_

In [None]:
res.best_score_

# Linear regression

In this part, we will solve an linear regression task to predict our target `score_compound`, i.e. continous sentiment score of tweets, using our features which are encodings of the tweets.


In [None]:
#define some functions for plotting purposes

def plot_y_continous(y, bins=10, show=True, title=None):
    fig, ax = plt.subplots(1, 1)
    _ = ax.hist(y, bins=bins)
    if isinstance(title, str):
        ax.set_title(title)
    plt.tight_layout()
    if show: plt.show()

def plot_scatter(x, y,  show=True, x_label=None, y_label=None,  title=None):
    fig, ax = plt.subplots(1, 1)
    _ = ax.scatter(x,y)
    if isinstance(title, str):
        ax.set_title(title)
    if isinstance(x_label, str):
        ax.set_xlabel(x_label)
    if isinstance(y_label, str):
        ax.set_ylabel(y_label)
    plt.tight_layout()
    if show: plt.show()
    

In [None]:
#create X (feature matrix) and y (targets)
X = sentence_embeddings
y = df.score_compound.values
print(f"X: {X.shape}")
print(f"y: {y.shape}")



In [None]:
z = df.sentiment.values

In [None]:
plt.close('all')
plot_y_continous(y, bins=20, title='Histogram of Target variable')

In its simplest form, predictions of a linear regression model can be summarized as

$$
\hat{y} = \mathbf{w}^T \mathbf{x} = f(\mathbf{x},\mathbf{w})
$$

which can be optimized the cost functions 

$$
\mathbf{w}^{*}=\underset{\mathbf{w}}{\arg \min } \frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-f\left(\mathbf{x}_{i}, \mathbf{w}\right)\right)^{2}
$$

In [None]:
from sklearn import linear_model, svm, tree, metrics
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import train_test_split, HalvingGridSearchCV
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.naive_bayes import ComplementNB
import sklearn.metrics as skm
import numpy as np

In [None]:
from imblearn.combine import SMOTEENN

# Using K-Fold strategy to achieve better generalization (5 and 10)

In [None]:
def test_model(strategy, model, X, y):
    sum_rmse = 0
    length = 0
    for (train, test) in strategy.split(X, y):
        reg = model.fit(X[train], y[train])
        y_pred = reg.predict(X[test])
        rmse = np.sqrt(skm.mean_squared_error(y[test], y_pred))
        sum_rmse += rmse
        length += 1
    print("\n strategy = ", strategy, "\n model = ", model, "\n avg_rmse = ", sum_rmse / length)


In [None]:
# question: should we normalize the data? for now, I do not think that we need it

In [None]:
#strategy for model selection
kf5 = KFold(n_splits = 5)
kf10 = KFold(n_splits = 10)

In [None]:
#model that we wanted to try

# 1. linear regression (example)
lr = linear_model.LinearRegression(fit_intercept=True, copy_X=True, n_jobs=None, positive=False)

# 1. Lasso
lasso = linear_model.Lasso()

# 2. SVM
svm = svm.SVR()

# 3. SGD 
sgd = linear_model.SGDRegressor()

# 4. Decision tree
dt = tree.DecisionTreeRegressor()

# 5. neural network model


In [None]:
nb = ComplementNB()

In [None]:
nb_param = {
    'alpha': [1, 0.5, 0.1, 1e-2, 1e-3],
    'norm': [False, True]    
}

In [None]:
nb_gscv = HalvingGridSearchCV(nb, param_grid = nb_param, n_jobs = -1, cv = kf10, scoring = metrics.mean_squared_error)

In [None]:
nb_gscv.fit(X_train,z_train)

In [None]:
sgd_param = {
    'loss': ['squared_error', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive'],
    'penalty': ['l2', 'l1', 'elasticnet'],
    'alpha': [1e-2, 1e-3, 1e-4, 1e-5],
    'tol': [ 1e-3, 1e-2, 1e-4],
    'epsilon': [0.5, 0.25, 0.1, 0.05, 0.01],
    'learning_rate': ['adaptive', 'optimal', 'invscaling'],
}

In [None]:
sgd_gscv = HalvingGridSearchCV(sgd, param_grid = sgd_param, n_jobs = -1, cv = kf10, scoring = metrics.mean_squared_error)

In [None]:
sgd_gscv.fit(X,y)

In [None]:
sgd_gscv.best_params_

In [None]:
y_pred = sgd_gscv.predict(X_val)
rmse = np.sqrt(skm.mean_squared_error(y_val, y_pred))
print(rmse)

In [None]:
dt_param = {'max_depth':                 [2, 4, 8, 16, 32, 64, 128, 256],
            'min_samples_split' :        [2, 4, 8, 16, 32, 64],
            'min_samples_leaf' :         [2, 4, 8, 16, 32, 64],
            'min_weight_fraction_leaf' : [0, 1e-3, 1e-4, 1e-5],
            'min_impurity_decrease' :    [0, 1e-3, 1e-4, 1e-5]           
           }

In [None]:
reg_gscv = HalvingGridSearchCV(dt, param_grid = dt_param, n_jobs = -1, cv = kf5, scoring = metrics.mean_squared_error)

In [None]:
reg_gscv.fit(X, y)

In [None]:
reg_gscv.best_params_

In [None]:
y_pred = reg_gscv.predict(X_val)
rmse = np.sqrt(skm.mean_squared_error(y_val, y_pred))
print(rmse)

In [None]:
#split X and y for training and validation purposes
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# X_trains, X_vals, z_train, z_val = train_test_split(X, z, test_size=0.2, random_state=42)

datasets = [
    [X_train, y_train],
    [X_val, y_val]
]

#create our linear regression model
# reg = linear_model.LinearRegression(fit_intercept=True, copy_X=True, n_jobs=None, positive=False)

In [None]:
from imblearn.over_sampling import SMOTE, RandomOverSampler
from collections import Counter

In [None]:
ros = RandomOverSampler(random_state = 42)

In [None]:
X_res, z_res = ros.fit_resample(X_train, z_train)
y_res, z_res = ros.fit_resample(y_train.reshape((6400,1)), z_train)

In [None]:
X_res, z_res = SMOTE().fit_resample(X_train, z_train)
##lanjut sini
y_res, z_res = SMOTE().fit_resample(y_train.reshape((6400,1)), z_train)

In [None]:
X_res, z_res = SMOTEENN().fit_resample(X_train, z_train)
y_res, z_res = SMOTEENN().fit_resample(y_train.reshape((6400,1)), z_train)

In [None]:
y_res.reshape(6299,)

In [None]:
len(np.unique(X_res)) != len(X_res)

In [None]:
z = df.sentiment.values
le = preprocessing.LabelEncoder()
le.fit(z)
z =le.transform(z)


In [None]:
#split X and y for training and validation purposes
#it is split with categorical features because we want to use SMOTE first


Now fit a linear regression model on the training data.

In [None]:
reg = reg.fit(X_res, y_res)

In [None]:
# Evaluate our predictor quantitatively
for split_name, dataset in zip(['train', 'valididation'], datasets):
    X_i, y_i = dataset
    y_pred = reg.predict(X_i)

    rmse = np.sqrt(skm.mean_squared_error(y_i, y_pred))
    print(f'\nSplit: {split_name}')
    print(f"\tRMSE: {rmse:.2f}")
    mae = skm.mean_absolute_error(y_i, y_pred)
    print(f"\tMAE: {mae:.2f}")

In [None]:
#plot the histogram of learnt weights w_i 
plot_y_continous(reg.coef_, bins=20, title='Histogram of Parameters (w) learnt')

At this point, we can use our model to predict sentiments scores of tweets from `X_test`, i.e. test set. Do not forget to encode them as well.

And save your predictions `y_hat` by naming it with the following format. 

`<TEAM_ID>__<SPLIT>_reg_pred.npy`

Make sure that

`<TEAM_ID>` is your team id as given in CMS.

`<SPLIT>` is "test_1" during the semester and "test_2" for final submission. You will be notified when we need to move to "test_2".

In [None]:
# Run this to save a file with your predictions on the test set to be submitted

sentences = list(df_test.words_str.values)
X_test = model.encode(sentences)
y_hat = reg.predict(X_test)

# Save the results with the format <TEAM_ID>__<SPLIT>_reg_pred.npy

folder = 'result'
np.save(os.path.join(folder, f'{team_id}__{split}__reg_pred.npy'), y_hat)


# Linear classification

In this part, we will solve a linear classification task to predict our target `sentiment`, i.e. sentiment class of tweets, using our features which are encodings of the tweets.


In [None]:
from sklearn import linear_model
from sklearn import preprocessing
import numpy as np

In [None]:
def plot_y_discrete(y, show=True, title=None):
    fig, ax = plt.subplots(1, 1)
    sns.countplot(x=y, palette=colors, ax=ax)
    if isinstance(title, str):
        ax.set_title(title)
    plt.tight_layout()
    if show: plt.show()

In [None]:
plot_y_discrete(df.sentiment)

We will first change our targets (classes; positive, neutral, negative) to numeric targets. Then, we solve a logistic regression problem by minimizing the multinomial cross-entropy function

$$
J(\theta) = -\frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} \mathbb{1}_{y_{i}=k} \log(p_{\theta}(\hat{y}=k | \mathbf{x}_{i}))
$$

where $y_i \in \{1,\ldots,K\}$ and $p_{\theta}(\hat{y}=k | \mathbf{x}_{i})$ is the probability assigned by our model to class $k$ having observed features $\mathbf{x}_{i}$.

In [None]:
X = sentence_embeddings
y_text = df.sentiment.values
le = preprocessing.LabelEncoder()
le.fit(y_text)
print(f'Original classes {le.classes_}')
print(f'Corresponding numeric classes {le.transform(le.classes_)}')
y =le.transform(y_text)
print(f"X: {X.shape}")
print(f"y: {y.shape} {np.unique(y)}")
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

datasets = [
    [X_train, y_train],
    [X_val, y_val]
]
clf = linear_model.LogisticRegression(penalty='none', 
                                      dual=False, 
                                      tol=0.0001, 
                                      C=1.0, 
                                      fit_intercept=True, 
                                      intercept_scaling=1, 
                                      class_weight=None, # None, balanced
                                      random_state=None, 
                                      solver='lbfgs', 
                                      max_iter=1000, 
                                      multi_class='auto', 
                                      verbose=0, 
                                      warm_start=False, 
                                      n_jobs=None, 
                                      l1_ratio=None
                                      
                                     )

In [None]:

clf = linear_model.LogisticRegression(penalty='none', 
                                      dual=False, 
                                      tol=0.0001, 
                                      C=1.0, 
                                      fit_intercept=True, 
                                      intercept_scaling=1, 
                                      class_weight=None, # None, balanced
                                      random_state=None, 
                                      solver='lbfgs', 
                                      max_iter=1000, 
                                      multi_class='auto', 
                                      verbose=0, 
                                      warm_start=False, 
                                      n_jobs=None, 
                                      l1_ratio=None
                                      
                                     )

# MODEL

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

In [None]:
svc = SVC(class_weight = 'balanced', kernel = 'poly', coef0 = 1e-2)


In [None]:
test_model(kf10, svc, X, y)

In [None]:
test_model(kf10, svc, words12g75000_tfidf, y)

In [None]:
svc_param = {
#     'C': [1.0, 0.5, 0.1],
    'kernel': ['poly'],
#     'degree': [3],
#     'gamma': ['scale', 'auto'],
    'coef0': [0, 0.1, 1e-2, 1e-3],
#     'tol': [1e-2, 1e-3, 1e-4],
    
}

In [None]:
# svc = svc.fit(X_res, y_res)
svc_gscv = GridSearchCV(svc, param_grid = svc_param, n_jobs = -1, cv = kf5, scoring = metrics.f1_score, verbose = 3)

In [None]:
svc_gscv = svc_gscv.fit(X_res, y_res)

In [None]:
svc_gscv.best_params_

In [None]:
X_res, y_res = SMOTE().fit_resample(X_train, y_train)

In [None]:
Counter(y_res)

In [None]:
datasets = [
    [X_res, y_res],
    [X_val, y_val]
]

In [None]:

print(f"X: {X_res.shape}")
print(f"y: {y_res.shape} {np.unique(y)}")

Fit your model by using training data.

In [None]:
clf = clf.fit(X_res, y_res)


In [None]:
y_pred = svc_gscv.predict(X_val)
rmse = np.sqrt(skm.mean_squared_error(y_val, y_pred))                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
print(rmse)


Reminders about macro and micro averaging:


In the context of computing F1-score, "macro" and "micro" averaging are two commonly used techniques to aggregate the per-class F1-scores.

**Micro-average**: Compute the F1-score globally by counting the total true positives, false negatives, and false positives over all classes, and then calculating precision, recall, and F1-score using these aggregated values.

**Macro-average**: Calculate the F1-score for each class separately, and then take the average of these per-class F1-scores.

The main difference between these two techniques is the way they treat class imbalance. Micro-average treats all classes equally, regardless of their size, while macro-average treats each class equally, regardless of the number of samples in that class.

Micro-average is often used when we care about overall performance across all classes, and we want to give more weight to the performance on larger classes. In contrast, macro-average is often used when we want to evaluate the performance on each class separately and give equal weight to each class.


In addition to micro and macro averaging, there is another common technique for computing the F1-score called **weighted averaging**.

**Weighted averaging** is similar to macro averaging in that it computes the per-class F1-score and then takes the average of these scores. However, unlike macro averaging, weighted averaging takes into account the number of samples in each class when computing the average. Specifically, the weighted average is computed as follows:

- Compute the F1-score for each class separately.
- Compute the weight for each class as the number of samples in that class divided by the total number of samples.
- Compute the weighted average of the per-class F1-scores, where each per-class F1-score is weighted by the weight of that class.

The weighted average is commonly used when the dataset is imbalanced, meaning that some classes have many more samples than others. In such cases, using the simple average (macro-average) would give too much weight to the smaller classes, while using micro-average would give too much weight to the larger classes. The weighted average strikes a balance between these two approaches by giving more weight to the classes with more samples while still taking into account the performance of all classes.


Now evaluate your model

In [None]:
for split_name, dataset in zip(['train', 'validation'], datasets):
    X_i, y_i = dataset
    y_pred = svc.predict(X_i)
    print(f'\nSplit: {split_name}')
    print(skm.classification_report(y_i, y_pred))

At this point, we can use our model to predict sentiments scores of tweets from `X_test`, i.e. test set. Do not forget to encode them as well.

And save your predictions `y_hat` by naming it with the following format. 

`<TEAM_ID>__<SPLIT>_clf_pred.npy`

Make sure that

`<TEAM_ID>` is your team id as given in CMS.

`<SPLIT>` is "test_1" during the semester and "test_2" for final submission. You will be notified when we need to move to "test_2".

In [None]:
# Run this to save a file with your predictions on the test set to be submitted
sentences = list(df_test.words_str.values)
X_test = model.encode(sentences)
y_hat = clf.predict(X_test)

# Save the results with the format <TEAM_ID>__<SPLIT>_clf_pred.npy

folder = 'result'
np.save(os.path.join(folder, f'{team_id}__{split}__clf_pred.npy'), y_hat)

# Submission to CMS

Put your .npy files for both regression and classification tasks in the same zip file. Please name the file as `<TEAM_ID>.zip` and upload it to CMS system. It is essential that the files inside the .zip are named as follow:

`<TEAM_ID>__<SPLIT>__reg_pred.npy` \
`<TEAM_ID>__<SPLIT>__clf_pred.npy` \

Above, `<SPLIT>` should correspond to `test_1` for the leaderboard and `test_2` for the final submission. 

In [None]:
res = np.load('result/1__test_1__reg_pred.npy')

In [None]:
res