# 1. Week 36

In [1]:
pip install transformers


Note: you may need to restart the kernel to use updated packages.


In [85]:
import pandas as pd
from collections import Counter

#classifier 
import torch
import torch.nn as nn
import torch.nn.functional as F 
import torch.optim as optim
from torch.utils.data import DataLoader #for large datasets; batches the data (e.g. 64 samples at a time) & shuffles data (so training isn't biased)

In [None]:
# load huggingface machine translation model
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")

## 1.1 Summarize data statistics 
- (size, word count, etc.) for training and validation data in the languages Arabic (ar), Korean (ko) and Telugu (te)

### 1.1.2 Set up datasets
- df_train_ar
- df_train_ko
- df_train_te
- df_val_ar
- df_val_ko
- df_val_te

In [3]:
splits = {'train': 'train.parquet', 'validation': 'validation.parquet'}
df_train = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["train"])
df_val = pd.read_parquet("hf://datasets/coastalcph/tydi_xor_rc/" + splits["validation"])


In [4]:
#training set - splitting up the languages
df_train_ar = df_train[df_train['lang']== 'ar']
df_train_ar[:5]

Unnamed: 0,question,context,lang,answerable,answer_start,answer,answer_inlang
11213,متى تدخلت روسيا في الحرب الأهلية السورية؟,The Russian military intervention in the Syria...,ar,True,67,September 2015,
11214,متى حصلت هنغاريا على استقلالها من النمسا ؟,"By 1918, the economic situation had deteriorat...",ar,True,454,October 1918,
11215,متى تحالفت فرنسا و بريطانيا العظمى ضد ألمانيا ...,France and Britain declared war on Germany whe...,ar,True,81,1939,
11216,كم عدد ضحايا أول إعتداء إسرائيلي على مدينة غزة ؟,The 2014 Israel–Gaza conflict also known as Op...,ar,True,607,death of thousands of people,
11217,هل سلسلة هاري بوتر مخالفة لقوانين المسيحية ؟,"Religious debates over the ""Harry Potter"" seri...",ar,False,-1,no,


In [5]:
df_train_ko = df_train[df_train['lang']== 'ko']
df_train_ko[:5]


Unnamed: 0,question,context,lang,answerable,answer_start,answer,answer_inlang
4792,30년 전쟁의 승자는 누구인가?,The conflict between France and Spain continue...,ko,True,21,France,
4793,엑스선은 누가 발견하였는가?,"X-rays make up X-radiation, a form of electrom...",ko,True,503,Wilhelm Röntgen,
4794,아테네에서 언제 가장 최근의 올림픽이 올렸나요?,"In 2022, Beijing will become the first-ever ci...",ko,True,188,2004,
4795,세상에서 가장 오래된 방송사는 무엇인가?,The British Broadcasting Corporation (BBC) is ...,ko,True,4,British Broadcasting Corporation (BBC),
4796,팔레스타인 수도는 어딘가요?,"Palestine ( '), officially the State of Palest...",ko,True,205,Jerusalem,


In [6]:
df_train_te = df_train[df_train['lang']== 'te']
df_train_te[:5]

Unnamed: 0,question,context,lang,answerable,answer_start,answer,answer_inlang
13771,ప్రపంచంలో మొట్టమొదటి దూర విద్య విద్యాలయం ఏ దే...,"Referred to as ""People's University"" by Charle...",te,True,236,London,
13772,1959వ సంవత్సరంలో భారతదేశ ప్రధాన మంత్రి ఎవరు?,"Since 1947, there have been 14 different prime...",te,True,220,Jawaharlal Nehru,
13773,ఏ కాకతీయ రాజు కర్నూలు జిల్లాను చివరిగా పాలించాడు?,"Rani Rudrama Devi (died 1289 or 1295), who def...",te,True,194,Prataparudra,
13774,మానవ హక్కులు ఎన్ని?,The Declaration consists of 30 articles affirm...,te,True,28,30,
13775,భారదేశంలో అత్యధిక జనాభా కలిగిన రాష్ట్రం ఏది?,"Uttar Pradesh (; IAST: ""Uttar Pradeś"" ) is a s...",te,True,0,Uttar Pradesh,


In [7]:
#validation set - splitting up languages
df_val_ar = df_val[df_val['lang']== 'ar']
df_val_ar[:5]

Unnamed: 0,question,context,lang,answerable,answer_start,answer,answer_inlang
1411,ما هي أولى جامعات فنلندا؟,"The Royal Academy of Åbo ( or ""Åbo Kungliga Ak...",ar,True,4,Royal Academy of Åbo,
1412,ما عدد الدول المطلة على بحر البلطيق؟,The Baltic Sea is a marginal sea of the Atlant...,ar,True,68,"Finland, Sweden, Denmark, Estonia, Latvia, Lit...",
1413,اين عاش نيوتن؟,"From age 12 to age 17, Newton resided with Wil...",ar,True,74,Grantham,
1417,هل زار ابن بطوطة اليمن؟,"After the ""hajj"" in either 1328 or 1330, he ma...",ar,False,-1,no,
1422,من هو الرئيس الأول للجمهورية اليمنية؟,The first President of unified Yemen was Ali A...,ar,True,41,Ali Abdullah Saleh,


In [8]:
df_val_ko = df_val[df_val['lang']== 'ko']
df_val_ko[:5]

Unnamed: 0,question,context,lang,answerable,answer_start,answer,answer_inlang
356,북유럽의 노르딕 국가는 몇개인가요?,"At the beginning of the 20th century, almost 1...",ko,True,393,five,
357,1887년 케이스 웨스턴 리저브 대학의 이름은 무엇인가?,Case Western Reserve University was created in...,ko,True,58,Western Reserve University (formerly Western R...,
358,옴진리교는 어느 나라에서 시작된 종교인가?,These letters are believed to have derived fro...,ko,True,51,Egypt,
359,댈러스의 면적은 얼마나 되나요?,Dallas is the county seat of Dallas County. Po...,ko,True,232,999.3 km2,
360,오픈스택의 프로그래밍 언어는 무엇인가요?,It is written in Python and uses many external...,ko,True,17,Python,


In [9]:
df_val_te = df_val[df_val['lang']== 'te']
df_val_te[:5]

Unnamed: 0,question,context,lang,answerable,answer_start,answer,answer_inlang
0,ఒరెగాన్ రాష్ట్రంలోని అతిపెద్ద నగరం ఏది ?,Portland is the largest city in the U.S. state...,te,True,0,Portland,
1,కలరా వ్యాధిని మొదటగా ఏ దేశంలో కనుగొన్నారు ?,"The word cholera is from ""kholera"" from χολή ""...",te,True,99,Indian subcontinent,
2,కలరా వ్యాధిని మొదటగా ఏ దేశంలో కనుగొన్నారు ?,Since it became widespread in the 19th century...,te,True,451,England,
3,మొదటి ప్రపంచ యుద్ధం ఎప్పుడు మొదలయింది ?,World War I occurred from 1914 to 1918. In ter...,te,True,26,1914,
4,మొదటి ప్రపంచ యుద్ధం ఎప్పుడు మొదలయింది ?,"World War I (often abbreviated as WWI or WW1),...",te,True,155,28 July 1914,


### 1.1.3 Find summary statistics

#### 1.1.3.1 Summary statistics: training data

function that 
1. tokenizes `question` column
2. finds word count
3. finds # of documents

In [64]:
def summary_stat(df, df_name, column):
    df.loc[:,'tokens'] = df[column].apply(lambda x: tokenizer.tokenize(x)) #create tokenized column
    df.loc[:,'word_count'] = df['tokens'].apply(len) #find word count of each row in tokenized column

    total_words = df['word_count'].sum() #find total word count 
    num_doc = len(df) #find number of documents 

    return {
        "df": df_name,
        "document length": num_doc,
        "total words": total_words
    }


In [58]:
print(
    f"summary stat training arabic: {summary_stat(df_train_ar, df_name = "df_train_ar", column = 'question')}\n"
    f"summary stat training korean: {summary_stat(df_train_ko, df_name = "df_train_ko", column = 'question')}\n"
    f"summary stat training telugu: {summary_stat(df_train_te, df_name = "df_train_te", column = 'question')}"
)

summary stat training arabic: {'df': 'df_train_ar', 'document length': 2558, 'total words': 33733}
summary stat training korean: {'df': 'df_train_ko', 'document length': 2422, 'total words': 25829}
summary stat training telugu: {'df': 'df_train_te', 'document length': 1355, 'total words': 18365}


#### 1.1.3.2 Summary statistics: validation data

In [65]:
print(
    f"summary stat validation arabic: {summary_stat(df_val_ar, df_name = "df_train_ar", column = 'question')}\n"
    f"summary stat validation korean: {summary_stat(df_val_ko, df_name = "df_train_ko", column = 'question')}\n"
    f"summary stat validation telugu: {summary_stat(df_val_te, df_name = "df_train_te", column = 'question')}"
)

summary stat validation arabic: {'df': 'df_train_ar', 'document length': 415, 'total words': 5604}
summary stat validation korean: {'df': 'df_train_ko', 'document length': 356, 'total words': 3775}
summary stat validation telugu: {'df': 'df_train_te', 'document length': 384, 'total words': 5020}


## 1.2 Report

For each of the languages Arabic, Korean and Telugu:
- report the 5 most common words in the questions from the training set and their count, 
- as well as their English translation
- What kind of words are they?

In [82]:
#create a function - and repeat for all 3 languages 
def common_words(df, df_name, column):
    token_list = [token for row in df[column] for token in row] #get a flattened list of all tokens
    word_count = Counter(token_list) #get word count of each token
    most_common_words = word_count.most_common(5) #get 5 most frequent tokens

    return{
        "df": df_name,
        "words": most_common_words
    }


In [83]:
print(
   f"5 most common words arabic: {common_words(df_train_ar, df_name = "df_train_ar", column = 'tokens')}\n"
   f"5 most common words korean: {common_words(df_train_ko, df_name = "df_train_ko", column = 'tokens')}\n"
   f"5 most common words telugu: {common_words(df_train_te, df_name = "df_train_te", column = 'tokens')}"
)

5 most common words arabic: {'df': 'df_train_ar', 'words': [('؟', 1483), ('▁؟', 1057), ('ية', 656), ('▁في', 609), ('▁من', 593)]}
5 most common words korean: {'df': 'df_train_ko', 'words': [('?', 2420), ('인가', 610), ('▁무엇인가', 592), ('은', 586), ('▁가장', 529)]}
5 most common words telugu: {'df': 'df_train_te', 'words': [('?', 1093), ('▁ఎవరు', 274), ('▁?', 260), ('▁ఏ', 223), ('▁ఏది', 192)]}


English translations:
These words are generally stop words - with the exception of Korean word for 'silver'

**arabic**
- ?
- ?
- ya
- in
- from

**korean**
- ?
- approval
- something
- silver
- most

**telugu**
- ?
- who
- ?
- A
- which is

## 1.3 Implement a rule base classifier
- That predicts whether a question is answerable or impossible, only using the document (context) and question
- Use the answerable field to evaluate it on the validation set. 
- What is the performance of your classifier for each of the languages Arabic, Korean and Telugu?

- need to define a classifier 
- argument for bag of words logistic/softmax regression (may need a source, this is from my notes)
  - works well on large data sets and long texts
  - outputs believable class probabilities 
- regularization prevents overfitting - so can use ridge regression (logistic regression with ridge penalty/loss)