# ML4NLP1
## Starting Point for Exercise 1, part I

This notebook is supposed to serve as a starting point and/or inspiration when starting exercise 1, part I.

One of the goals of this exercise is o make you acquainted with sklearn and related libraries like pandas and numpy. You will probably need to consult the documentation of those libraries:
- sklearn: [Documentation](https://scikit-learn.org/stable/user_guide.html)
- Pandas: [Documentation](https://pandas.pydata.org/docs/#)
- Numpy: [Documentation](https://numpy.org/doc/)

**Importing files to Google Colab:** If you have never used Colab or never uploaded a file to Colab, quickly skim over an introduction: [Introduction on medium](https://medium.com/@master_yi/importing-datasets-in-google-colab-c816fc654f97).

We're using the second method mentioned in the blogpost: (1) upload the four files `x_train.txt` and `y_train.txt`, `x_test.txt` and `y_test.txt` to a directory in Google Drive and (2) adjust the paths in the second cell to point to your uploaded files.

Then execute the first cell to give Colab permission to access the two files.

In [None]:
import pandas as pd
import numpy as np
import csv
import re
import string
from collections import defaultdict

In [None]:
# download dataset
!gdown 1QP6YuwdKFNUPpvhOaAcvv2Pcp4JMbIRs # x_train
!gdown 1QVo7PZAdiZKzifK8kwhEr_umosiDCUx6 # x_test
!gdown 1QbBeKcmG2ZyAEFB3AKGTgSWQ1YEMn2jl # y_train
!gdown 1QaZj6bI7_78ymnN8IpSk4gVvg-C9fA6X # y_test

Downloading...
From: https://drive.google.com/uc?id=1QP6YuwdKFNUPpvhOaAcvv2Pcp4JMbIRs
To: /content/x_train.txt
100% 64.1M/64.1M [00:00<00:00, 260MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QVo7PZAdiZKzifK8kwhEr_umosiDCUx6
To: /content/x_test.txt
100% 65.2M/65.2M [00:00<00:00, 164MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QbBeKcmG2ZyAEFB3AKGTgSWQ1YEMn2jl
To: /content/y_train.txt
100% 480k/480k [00:00<00:00, 96.2MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QaZj6bI7_78ymnN8IpSk4gVvg-C9fA6X
To: /content/y_test.txt
100% 480k/480k [00:00<00:00, 89.2MB/s]


In [None]:
with open(f'x_train.txt') as f:
    x_train = f.read().splitlines()
with open(f'y_train.txt') as f:
    y_train = f.read().splitlines()
with open(f'x_test.txt') as f:
    x_test = f.read().splitlines()
with open(f'y_test.txt') as f:
    y_test = f.read().splitlines()

In [None]:
# combine x_train and y_train into one dataframe
train_df = pd.DataFrame({'text': x_train, 'label': y_train})
# write train_df to csv with tab as separator
train_df.to_csv('train_df.csv', index=False, sep='\t')
# combine x_test and y_test into one dataframe
test_df = pd.DataFrame({'text': x_test, 'label': y_test})

In [None]:
train_df.head()

Unnamed: 0,text,label
0,Klement Gottwaldi surnukeha palsameeriti ning ...,est
1,"Sebes, Joseph; Pereira Thomas (1961) (på eng)....",swe
2,भारतीय स्वातन्त्र्य आन्दोलन राष्ट्रीय एवम क्षे...,mai
3,"Après lo cort periòde d'establiment a Basilèa,...",oci
4,ถนนเจริญกรุง (อักษรโรมัน: Thanon Charoen Krung...,tha


In [None]:
# get list of all labels
labels = train_df['label'].unique().tolist()
print(labels)

['est', 'swe', 'mai', 'oci', 'tha', 'orm', 'lim', 'guj', 'pnb', 'zea', 'krc', 'hat', 'pcd', 'tam', 'vie', 'pan', 'szl', 'ckb', 'fur', 'wuu', 'arz', 'ton', 'eus', 'map-bms', 'glk', 'nld', 'bod', 'jpn', 'arg', 'srd', 'ext', 'sin', 'kur', 'che', 'tuk', 'pag', 'tur', 'als', 'koi', 'lat', 'urd', 'tat', 'bxr', 'ind', 'kir', 'zh-yue', 'dan', 'por', 'fra', 'ori', 'nob', 'jbo', 'kok', 'amh', 'khm', 'hbs', 'slv', 'bos', 'tet', 'zho', 'kor', 'sah', 'rup', 'ast', 'wol', 'bul', 'gla', 'msa', 'crh', 'lug', 'sun', 'bre', 'mon', 'nep', 'ibo', 'cdo', 'asm', 'grn', 'hin', 'mar', 'lin', 'ile', 'lmo', 'mya', 'ilo', 'csb', 'tyv', 'gle', 'nan', 'jam', 'scn', 'be-tarask', 'diq', 'cor', 'fao', 'mlg', 'yid', 'sme', 'spa', 'kbd', 'udm', 'isl', 'ksh', 'san', 'aze', 'nap', 'dsb', 'pam', 'cym', 'srp', 'stq', 'tel', 'swa', 'vls', 'mzn', 'bel', 'lad', 'ina', 'ava', 'lao', 'min', 'ita', 'nds-nl', 'oss', 'kab', 'pus', 'fin', 'snd', 'kaa', 'fas', 'cbk', 'cat', 'nci', 'mhr', 'roa-tara', 'frp', 'ron', 'new', 'bar', 'ltg'

**Question 1:**

In [None]:
# T: Have a quick peek at the training data, looking at a couple of texts from different languages. Do you notice anything that might be challenging for the classification?

Answer:

Some sentences might contain words or phrases from multiple languages, making it challenging to determine the predominant language. Furthermore, some languages share the same character set or script, which can make it challenging to differentiate them based on character analysis alone. For example, differentiating between Spanish and Italian can be difficult using character-level features. Additionally, several languages such as Swedish and Norwegian is very similar to each other, but minor differences in vocabulary and context, making them hard to do classification.

**Question 2:**

In [None]:
# T: How many instances per label are there in the training and test set? Do you think this is a balanced dataset? Do you think the train/test split is appropriate? If not, please rearrange the data in a more appropriate way.

In [None]:
lbl_instances = []

for lbl in labels:
  instances = []
  train_instances = train_df[train_df.label == lbl]['text'].values.tolist()
  test_instances = test_df[test_df.label == lbl]['text'].values.tolist()
  train_count = len(train_instances)
  test_count = len(test_instances)
  same_flag = 0 # same_flag: Is instances in train and test set same?
  if train_count == test_count:
    same_flag = 1
  instances.append(lbl)
  instances.append(train_count)
  instances.append(test_count)
  instances.append(same_flag)
  lbl_instances.append(instances)

import pandas as pd
temp_df = pd.DataFrame(lbl_instances, columns=['Label', 'TrainSet_Count', 'TestSet_Count', 'Same?'])
temp_df

Unnamed: 0,Label,TrainSet_Count,TestSet_Count,Same?
0,est,500,500,1
1,swe,500,500,1
2,mai,500,500,1
3,oci,500,500,1
4,tha,500,500,1
...,...,...,...,...
230,ell,500,500,1
231,lij,500,500,1
232,hau,500,500,1
233,mkd,500,500,1


In [None]:
print(f"Number of labels which instances different from train and test set? : {len(temp_df[temp_df['Same?'] == 0])}")

Number of labels which instances different from train and test set? : 0


In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd

def rearrange_split(train_df, test_df):
  combined_df = pd.concat([train_df, test_df], axis=0)
  X = combined_df.drop(columns=['label'])  # Assuming 'label' is the column containing the labels
  y = combined_df['label']
  X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2) #80% training, 20% testing
  return X_train['text'], X_test['text'], Y_train, Y_test

Answer:

For each label, there are 500 instances in both the training and test sets, resulting in a balanced dataset with an equal distribution of instances across the labels in both sets. However, we think the train/test split is inappropriate, hence we suggest rearrange the dataset into an 80% training and 20% testing split. The 80/20 split provides a larger training set, which can be beneficial in improving model performance and generalization.

**Question 3:**

In [None]:
# T: Get a subset of the train/test data that includes English, German, Dutch, Danish, Swedish and Norwegian, plus 20 additional languages of your choice (the labels can be found in the file labels.csv)

In [None]:
included_languages = ['eng', 'deu', 'nld', 'dan', 'swe', 'nno',
                      'ara', 'hak', 'lzh', 'tha', 'fra',
                      'mal', 'zh-yue', 'zho', 'bar', 'tam',
                      'tur', 'ukr', 'vie', 'cdo', 'ces',
                      'div', 'ell', 'jpn', 'kor', 'lad']

In [None]:
# Filter train_df and test_df to include only specified languages
train_df = train_df[train_df['label'].isin(included_languages)]
test_df = test_df[test_df['label'].isin(included_languages)]

X_train, X_test, Y_train, Y_test = rearrange_split(train_df, test_df)

**Question 4:**

In [None]:
# T: With the following code, we wanted to encode the labels, however, our cat was walking on the keyboard and some of it got changed. Can you fix it?
from sklearn.preprocessing import LabelEncoder
le_fitted = LabelEncoder()
Y_train = le_fitted.fit_transform(Y_train)
Y_test = le_fitted.transform(Y_test)

**Part 2.**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, HashingVectorizer, TfidfVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
tfidf_vectorizer = TfidfVectorizer(
    strip_accents='unicode',
    analyzer='char',
    ngram_range=(1, 4),  # n-gram range
    max_features=5000,  # number of features
)

count_vectorizer = CountVectorizer(ngram_range=(1, 4), min_df=5, max_df=0.8)

# Create a custom function for text pre-processing to remove digits
def preprocess_text(text_series):
  # Remove digits
  text_series = text_series.apply(lambda text: re.sub(r'\d', '', text))
  # Remove special characters
  text_series = text_series.apply(lambda text: re.sub(r'[)(*&^%$#@!]', '', text))
  return text_series


**Part 3.**

In [None]:
# Create a Logistic Regression classifier
logistic_classifier = LogisticRegression(solver='lbfgs', multi_class='multinomial', penalty='l2', max_iter=1000)

In [None]:
def fit_transform_predict(X_train, Y_train, X_test, Y_test):
  # Create a pipeline with the vectorizer and classifier
  pipeline = Pipeline([
      ('preprocess_text', FunctionTransformer(func=preprocess_text, validate=False)),
      ('union', FeatureUnion(
          transformer_list=[
              ('tfidf', tfidf_vectorizer),
              ('count_vectorizer', count_vectorizer),
          ],
      )),
      ('classifier', logistic_classifier)
  ])

  # Train the model
  pipeline.fit(X_train, Y_train)

  # Predict on the test set
  Y_pred = pipeline.predict(X_test)

  return Y_pred

In [None]:
Y_pred = fit_transform_predict(X_train, Y_train, X_test, Y_test)
# Compute accuracy between Y_test and Y_pred
accuracy = accuracy_score(Y_test, Y_pred)
print(f'Accuracy: {accuracy}')

Accuracy: 0.9790384615384615


part 4.

In [None]:
from sklearn.model_selection import GridSearchCV
#creat a pipeline
pipe = Pipeline([
      ('preprocess_text', FunctionTransformer(func=preprocess_text, validate=False)),
      ('union', FeatureUnion(
          transformer_list=[
              ('tfidf', tfidf_vectorizer),
              ('count_vectorizer', count_vectorizer),
          ],
      )),
      ('classifier', logistic_classifier)
  ])
# creat a parameter grid
parameter_grid={
    'classifier__max_iter': [300,500],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__solver': ['newton-cg','lbfgs']#, 'sag'，'saga','liblinear',
}
#pass the grid in Grid search cv
grid_search=GridSearchCV(pipe,parameter_grid,cv=3)
grid_search.fit(X_train,Y_train)
print(grid_search.best_estimator_)

12 fits failed out of a total of 24.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
6 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py", line 405, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py", line 1162, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py", line 54, in _check_so

Pipeline(steps=[('preprocess_text',
                 FunctionTransformer(func=<function preprocess_text at 0x7905320d7a30>)),
                ('union',
                 FeatureUnion(transformer_list=[('tfidf',
                                                 TfidfVectorizer(analyzer='char',
                                                                 max_features=5000,
                                                                 ngram_range=(1,
                                                                              4),
                                                                 strip_accents='unicode')),
                                                ('count_vectorizer',
                                                 CountVectorizer(max_df=0.8,
                                                                 min_df=5,
                                                                 ngram_range=(1,
                                                                   

**Extras:**

In [None]:
best_pipeline = grid_search.best_estimator_

tfidf_vectorizer = best_pipeline.named_steps['union'].transformer_list[0][1]
logistic_model = best_pipeline.named_steps['classifier']

def extract_top_features(language):
    train_df_filtered = train_df[train_df['label'] == language]
    test_df_filtered = test_df[test_df['label'] == language]
    X_train, X_test, Y_train, Y_test = rearrange_split(train_df_filtered, test_df_filtered)
    X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
    feature_names = tfidf_vectorizer.get_feature_names_out()
    coefs = logistic_model.coef_[0]
    features_with_coefs = list(zip(feature_names, coefs))
    features_with_coefs.sort(key=lambda x: abs(x[1]), reverse=True)
    return features_with_coefs[:10]
top_features_eng=extract_top_features('eng')
top_features_jpn=extract_top_features('jpn')
top_features_swe=extract_top_features('swe')
top_features_nno=extract_top_features('nno')
print(f"English top ten features:{top_features_eng}\n Japanese top ten features:{top_features_jpn} \n Swedish top ten features:{top_features_swe}\n Norwegian top ten features:{top_features_nno}")






English top ten features:[('onom', 2.371443093059724), ('ound', 1.5268017967667002), ('pen', 1.078812728911029), ('par', 0.91306932427157), ('ool', 0.8699666728379214), (' sti', 0.8460785522190952), ('owi', 0.8098307726397882), ('orme', 0.7058242624007021), (' sto', 0.6533328497142297), ('p w', 0.6106879743828785)]
 Japanese top ten features:[('合も', 2.371443093059724), ('場し', 1.5268017967667002), ('子供', 1.078812728911029), ('好', 0.91306932427157), ('向', 0.8699666728379214), ('km', 0.8460785522190952), ('外の', 0.8098307726397882), ('国', 0.7058242624007021), ('l', 0.6533328497142297), ('大学院', 0.6106879743828785)] 
 Swedish top ten features:[('olm', 2.371443093059724), ('orh', 1.5268017967667002), ('pa a', 1.078812728911029), ('over', 0.91306932427157), ('om r', 0.8699666728379214), (' sku', 0.8460785522190952), ('osa ', 0.8098307726397882), ('op', 0.7058242624007021), (' sl', 0.6533328497142297), ('ott', 0.6106879743828785)]
 Norwegian top ten features:[('om e', 2.371443093059724), ('orti

**Confusion Matrix**


In [None]:
best_pipeline.fit(X_train, Y_train)
Y_pred2=best_pipeline.predict(X_test)
Confusion_mat=confusion_matrix(Y_test, Y_pred2,normalize='all')
print(Confusion_mat)
print(np.trace(Confusion_mat))

[[0.04019231 0.         0.         0.         0.         0.
  0.         0.         0.00019231 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.        ]
 [0.         0.04115385 0.         0.         0.         0.00269231
  0.         0.         0.00076923 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.00019231
  0.00019231 0.        ]
 [0.         0.         0.04346154 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.00019231 0.00057692]
 [0.         0.         0.         0.03153846 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.     

In [None]:
# (To-do)

**Ablation study**

In [None]:
labels = np.unique(Y_test)
accuracy_dict = {}
top2_encoded_lang = []
for lbl in labels:
  iloc = np.where(Y_test==lbl)
  true_values = Y_test[iloc]
  pred_values = Y_pred[iloc]
  accu = (true_values == pred_values).sum() / len(true_values)
  accuracy_dict[lbl] = "{:.4f}".format(accu)

# Convert values to floats
accuracy_dict = {key: float(value) for key, value in accuracy_dict.items()}

# Sort the dictionary items by values in descending order
accuracy_dict = dict(sorted(accuracy_dict.items(), key=lambda item: item[1], reverse=True))

# Get the top 2 highest values and their keys
top_2_items = list(accuracy_dict.items())[:2]

print("Top 2 highest accuracy with their Encoded-Language:")
for key, value in top_2_items:
    top2_encoded_lang.append(key)
    print(f"Encoded-Language: {key}, Accuracy: {value}")


Top 2 highest accuracy with their Encoded-Language:
Encoded-Language: 6, Accuracy: 1.0
Encoded-Language: 20, Accuracy: 1.0


In [None]:
# First Encoded-Language
# Filter out Test Set
iloc = np.where(Y_test==top2_encoded_lang[0])
Y_test_sub1 = Y_test[iloc]
X_test_sub1 = X_test.to_numpy()[iloc]
# Filter out Train Set
iloc = np.where(Y_train==top2_encoded_lang[0])
Y_train_sub1 = Y_train[iloc]
X_train_sub1 = X_train.to_numpy()[iloc]

# Second Encoded-Language
# Filter out Test Set
iloc = np.where(Y_test==top2_encoded_lang[1])
Y_test_sub2 = Y_test[iloc]
X_test_sub2 = X_test.to_numpy()[iloc]
# Filter out Train Set
iloc = np.where(Y_train==top2_encoded_lang[1])
Y_train_sub2 = Y_train[iloc]
X_train_sub2 = X_train.to_numpy()[iloc]

# Combine two Encoded-Language Dataset
X_train_subset = pd.Series(np.concatenate((X_train_sub1, X_train_sub2), axis=None))
X_test_subset = pd.Series(np.concatenate((X_test_sub1, X_test_sub2), axis=None))
Y_train_subset = np.concatenate((Y_train_sub1, Y_train_sub2), axis=None)
Y_test_subset = np.concatenate((Y_test_sub1, Y_test_sub2), axis=None)

In [None]:
# Re-fit the best working model several times, each time reducing the number of characters per instance in the training set
# (1) All characters
# (2) 500 characters
# (3) 100 characters

# Define different character lengths
# character_lengths = [len(instance) for instance in X_train_subset]
character_lengths_to_test = ["MAX", 500, 100]

for char_length in character_lengths_to_test:
  # Create a version of training data with the specified character length
  if char_length == "MAX":
    X_train_temp = pd.Series([instance[:] for instance in X_train_subset])
  else:
    X_train_temp = pd.Series([instance[:char_length] for instance in X_train_subset])

  Y_pred_subset = fit_transform_predict(X_train_temp, Y_train_subset, X_test_subset, Y_test_subset)
  # Compute accuracy between Y_test and Y_pred
  accuracy = accuracy_score(Y_test_subset, Y_pred_subset)
  print(f'Accuracy for Char-{char_length}: {accuracy}')

Accuracy for Char-MAX: 1.0
Accuracy for Char-500: 1.0
Accuracy for Char-100: 1.0
