# <div align="center">CM1</div>


###  1.1 Required Libraries

In [5]:
import pandas as pd
import numpy as np
import re
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
import nltk
# nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize
import gensim.downloader as api
from datasets import load_dataset
from scipy import spatial
import math
from gensim import models
import gensim.downloader as api
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split

### 1.2 Load dataset 

- Identified there are large number of null values exist in columns, so in this assignment we have verified that there will be very less data exist if we remove null values.

- Moreover we will focus on text dataset to build a model. Hence there is no good reason to normalise or remove nll values.


**Dataset Overview :**

- Climate Fever dataset uses Fever methodology that consists of 1,535 real-world claims regarding climate-change collected on the internet.

- Each claim is accompanied by five manually annotated evidence ( Evidence 0 to 4 ) sentences retrieved from the  Wikipedia that support, refute or do not give enough information to validate the claim.

**Dataset Feature :**

1) claim_id : An unique claim identifier in datset.
2) claim :  claim text.
3) claim_label :   Overall label assigned to claim (based on evidence majority vote), The label correspond to 0: "refutes", 1: "supports" and 2: "not enough info".
4) evidences : A list of evidences with below fields : 
    1. evidence_id : An unique evidence identifier.
    2. evidence_label : A micro-verdict label, The label correspond to 0: "refutes", 1: "supports" and 2: "not enough info".
    3. article : A title of source article (Wikipedia page).
    4. evidence : An evidence sentence.
    5. entropy : An entropy reflecting uncertainty of evidence_label.
    6. votes : Refers to individual votes.
    
    
**Note :** In each claim there are total 5 evidences avaialbke in dataset ( From evidence 0 to 4).

In [6]:
dataset = load_dataset('climate_fever')
dataset['test']

Using custom data configuration default
Reusing dataset climate_fever (C:\Users\DELL\.cache\huggingface\datasets\climate_fever\default\1.0.1\3b846b20d7a37bc0019b0f0dcbde5bf2d0f94f6874f7e4c398c579f332c4262c)


Dataset({
    features: ['claim_id', 'claim', 'claim_label', 'evidences'],
    num_rows: 1535
})

### 1.3 Assigned claim  and evidence feature values to corpus ( In order to to apply word embedding )

In [7]:
sent_list = list()
for i in range(0, dataset["test"].num_rows):
    claim = dataset["test"][i]["claim"]
    sent_list.append(claim)
   
    for _, data in enumerate(dataset["test"][i]["evidences"]):
        article = data["article"]
        evidence = data["evidence"]
        sent_list.append(article)
        sent_list.append(evidence)

### 1.4 Text/Data Preprocessing :

- NLTK :
    - We used nltk.corpus package and stopwords library to remove stopwords from corpus.
- PorterStemmer :
    - Stemming is the process of producing morphological variants of a root/base word.
    - Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate”.
    
**Errors in Stemming :**
There are mainly two errors in stemming – **Overstemming** and **Understemming**. 
- Overstemming occurs when two words are stemmed to same root that are of different stems. 
- Under-stemming occurs when two words are stemmed to same root that are not of different stems.


In [8]:

claim_corpus = []

for i in range (0,1534):
    sentences = re.sub('[^a-zA-z]',' ', sent_list[i])
    sentences = sentences.lower()
    sentences = sentences.split()
    ps = PorterStemmer()
    all_stopwords = stopwords.words('english')
    all_stopwords.remove('not')
    sentences = [ps.stem(word) for word in sentences if not word in set(all_stopwords)]
    sentences = ' '.join(sentences)
    claim_corpus.append(sentences)
    

### 1.5 Converted every word of the corpus to embedding vectors in order to embed the text dataset with Word2Vec

In [9]:
corpusVec = [nltk.word_tokenize(sentences) for sentences in claim_corpus]

### 1.6 Build a word2Vec model

- Word2Vec model maps words to real number vectors, at the same time capturing something about the meaning of the text. It says that if two words have similar meaning they will lie close to each other in the dense space. 
- Word2Vec model contains two models for training Skip-Gram model and continuous bag of words(CBOW).

**Parameters :**
- min_count (int) – Ignores all words with total frequency lower than min_count value
- size (int) – Dimensionality of the feature vectors.
- window (int) – The maximum distance between the current and predicted word within a sentence.
- seed (int) – Seed for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed).

**Note :**
- We have build a Word2vec model on entire dataset and then after we have split word embeddings into train and test set.

In [10]:
model = Word2Vec(corpusVec,  min_count =1)
model.save("word2vec.bin")

### 1.7 Split word embeddings into train and test sets

- Train set (corpusVec_train) contains 80% of word embeddings.
- Test set (corpusVec_test) contains 20% of word embeddings.

In [11]:
corpusVec_train, corpusVec_test = train_test_split(corpusVec,test_size=0.20,random_state = None)

#### 1.7.1 Created dictionaries to store word and it's embedding

- 2 dictionaries are created ( 1 for train set and another for test set)
- As a window_size in Word2Vec model is 100 (default value) for each word there will be 100 numbers attached to it.

In [12]:
embedding_train = dict()
embedding_test = dict()


for sentence in corpusVec_train:
    for word in sentence:
        embedding_train[word] = model.wv[word]
        
for sentence in corpusVec_test:
    for word in sentence:
        embedding_test[word] = model.wv[word]

In [13]:
len(embedding_train.keys())

2194

In [14]:
len(embedding_test)

1006

#### 1.7.2 Created train and test dataframe from embeddings

In [15]:
train_emb_df = pd.DataFrame.from_dict(embedding_train)
test_emb_df = pd.DataFrame.from_dict(embedding_test)

In [16]:
train_emb_df.head()

Unnamed: 0,present,primari,sourc,co,emiss,burn,coal,natur,ga,petroleum,...,judaism,demograph,done,halt,bar,mpa,dedic,polit,landfil,bioga
0,-0.001782,0.005648,0.003972,0.009453,0.015537,0.000564,0.004776,0.009171,0.006098,0.003724,...,-0.000942,-0.003625,0.002108,0.004,-0.001295,-0.000734,0.004902,-0.002161,0.002322,0.004523
1,-0.010028,-0.001798,-0.004617,-0.034945,-0.035831,-0.011301,-0.006917,-0.037424,-0.021329,-0.002396,...,-0.002988,-0.001905,-0.005544,0.002553,0.000243,0.001047,-0.004576,0.000349,0.002963,-0.00263
2,0.011174,0.005265,0.006674,0.028128,0.034854,0.006606,0.00492,0.027529,0.01902,0.001535,...,0.004009,-0.003461,0.000587,0.005062,0.00089,0.003953,0.002531,0.000641,-0.002214,-0.001386
3,-0.009948,-0.008878,-0.009909,-0.045867,-0.047601,-0.007597,-0.008281,-0.03495,-0.031299,-0.011499,...,-0.004152,-0.003903,0.000796,-0.004872,-0.003172,0.002968,-0.00621,-0.002827,0.001878,0.0025
4,-0.008551,-0.006092,-0.015271,-0.046082,-0.044224,-0.009806,-0.00278,-0.040721,-0.028435,-0.011022,...,0.000291,-0.005975,0.002808,-0.004393,-0.004838,0.00404,0.003215,0.001921,0.001409,-0.005316


In [17]:
print(model)

Word2Vec(vocab=2397, size=100, alpha=0.025)


#### 1.7.3 Features of Word2Vec :

- After the model is trained, it is accessible via the “wv” attribute..This is the actual word vector model in which queries can be made.
- We can also save and load out Word2Vec model using "model.save("word2vec.bin")" and "model = Word2Vec.load('model.bin')" respectively.
- We can print the learned vocabulary of tokens (words) as follows:

In [18]:
words = model.wv.vocab
# print(words)


### 1.8 Cosine Similarity Function ( In order to calulate text similarity ): 

- Among different distance metrics, cosine similarity is more intuitive and most used in word2vec. It is normalized dot product of 2 vectors and this ratio defines the angle between them.
- Cosine similiarity between two vectors (A and B) can be calculated using dot(A, B)/(norm(A)*norm(B)). Here norm(A) and norm(B) indicates euclidean norm of vectors respectively.

**Note :** In order to build a function for cosine similarity we used scipy.spatial.distance.cosine.

In [19]:
def cos_similarity(v1, v2):
    return abs(1 - spatial.distance.cosine(v1,v2))

#### 1.8.1 Cosine Similarity Function Examples

In [20]:
ex1 = cos_similarity(embedding_train["slower"], embedding_train["global"])
ex2 = cos_similarity(embedding_train["high"], embedding_train["low"])
ex3 = cos_similarity(embedding_train["peopl"], embedding_train["human"])
ex4 = cos_similarity(embedding_train["australia"], embedding_train["warm"])
ex5 = cos_similarity(embedding_train["scientists"], embedding_train["guidelin"])
print(f"Cosine Similarity between 'slower'and 'global' is : {ex1}")
print(f"Cosine Similarity between 'high'and 'low' is : {ex2}")
print(f"Cosine Similarity between 'peopl'and 'human' is : {ex3}")
print(f"Cosine Similarity between 'australia'and 'warm' is : {ex4}")
print(f"Cosine Similarity between 'scientists'and 'guidelin' is : {ex5}")

Cosine Similarity between 'slower'and 'global' is : 0.6261319518089294
Cosine Similarity between 'high'and 'low' is : 0.9467276334762573
Cosine Similarity between 'peopl'and 'human' is : 0.8716719150543213
Cosine Similarity between 'australia'and 'warm' is : 0.9703410267829895
Cosine Similarity between 'scientists'and 'guidelin' is : 0.09903361648321152


#### 1.8.2 Analysis of Cosine Similarity 

- Cosine Similarity measured by the cosine of **the angle between two vectors** and determines whether two vectors are pointing in roughly the same direction or not.
- This similarity score ranges from 0 to 1, with 0 being the lowest (the least similar) and 1 being the highest (the most similar).
-  A cosine value of 0 means that the two vectors are at 90 degrees to each other (orthogonal) and have no match. 
- The closer the cosine value to 1, the smaller the angle and the greater the match between vectors.
- From above examples and embedded vector for a specific token ( model.wv[word] ) it is cleared that **more difference** between embedded vector for a specific token then lower value for cosine simillarity. ( for example scientists and guidelin)

### 1.9 Airthmetic computations on embedding vectors

- After successfully build a Word2Vec model, we can do create a little linear algebra arithmetic with words.
- Gensim provides an interface for performing these types of operations in the most_similar() function on the trained or loaded model.


Example 1 : **ocean - water + arctic**

In [21]:
model.wv.most_similar(positive=['ocean', 'arctic'], negative=['water'])

[('may', 0.7864083051681519),
 ('region', 0.7860227823257446),
 ('potenti', 0.7854949235916138),
 ('end', 0.7844065427780151),
 ('energi', 0.784231424331665),
 ('averag', 0.7840951085090637),
 ('core', 0.7837210893630981),
 ('releas', 0.7836814522743225),
 ('recent', 0.7832280397415161),
 ('rate', 0.7821105122566223)]

Example 2 :  **extrem - weather + temperatur** 

In [54]:
model.wv.most_similar(positive=['extrem', 'temperatur'], negative=['weather'])

[('mean', 0.9193539619445801),
 ('rate', 0.9187940359115601),
 ('human', 0.9180346131324768),
 ('ice', 0.9178087115287781),
 ('c', 0.9177825450897217),
 ('ga', 0.9171391725540161),
 ('year', 0.9167925715446472),
 ('increas', 0.9164981842041016),
 ('world', 0.9162797927856445),
 ('climat', 0.9155672788619995)]

Example 3: **'heat' - 'rise' + 'cold'**

In [23]:
model.wv.most_similar(positive=['heat', 'cold', 'warm'], negative=['rise'])
#model.wv.most_similar('heat')

[('extrem', 0.9505780935287476),
 ('would', 0.9499229788780212),
 ('c', 0.9487543106079102),
 ('sinc', 0.9486084580421448),
 ('surfac', 0.9485632181167603),
 ('caus', 0.9478595852851868),
 ('recent', 0.9478594064712524),
 ('f', 0.9477652311325073),
 ('report', 0.9471041560173035),
 ('natur', 0.9470906257629395)]

Example 4 : **'high' - 'temperatur' + 'heat'**

In [24]:
model.wv.most_similar(positive=['high', 'heat'], negative=['temperatur'])

[('emiss', 0.9673811197280884),
 ('like', 0.9671779870986938),
 ('co', 0.966587483882904),
 ('climat', 0.9659023284912109),
 ('warm', 0.9659013152122498),
 ('use', 0.9655390977859497),
 ('weather', 0.9654750823974609),
 ('averag', 0.9652884602546692),
 ('increas', 0.9649722576141357),
 ('result', 0.9646478891372681)]

Example 5 : **'climat' + 'chang' + 'temperatur'**

In [25]:
model.wv.most_similar(positive=['climat', 'chang', 'temperatur'])

[('increas', 0.9970048666000366),
 ('caus', 0.9969384074211121),
 ('co', 0.9964942932128906),
 ('human', 0.9958808422088623),
 ('year', 0.9957847595214844),
 ('c', 0.9956972599029541),
 ('carbon', 0.995648980140686),
 ('use', 0.9955489635467529),
 ('global', 0.9954564571380615),
 ('emiss', 0.995436429977417)]

Example 6 : **'hurrican' - 'storm' + 'wind'**

In [26]:
model.wv.most_similar(positive=['hurrican', 'wind'], negative=['storm'])

[('world', 0.8534212112426758),
 ('carbon', 0.8465461134910583),
 ('reduc', 0.8461534976959229),
 ('industri', 0.8460868000984192),
 ('scientist', 0.8454304933547974),
 ('model', 0.8453484177589417),
 ('oxygen', 0.8450800776481628),
 ('year', 0.8444372415542603),
 ('earth', 0.843956708908081),
 ('natur', 0.8435795307159424)]

#### 1.9.1 Analysis of airthmetic computations on the embedding vectors :

- From above 5 example, it is clear that our model will not able to identify correct words that we are looking for, However similarity score are also very high.
- In some airthmetic computations our model able to identify correct ( upto some extent ) word such as in example 4 (climat + chang + temperatur = increase with similarity score 0.997 but not able to identify word 'decrease').


**Most similar output for above example from our model :**

1. ocean - water + arctic = region
2. extrem - weather + temperatur = increase ( 1st most similar word is not related )
3. heat - rise + cold  = extrem ( Excepted is decrease )
4. high - temperatur + heat = warm
5. climat + chang + temperatur = increas ( Decrease is not able to idenitfied bby model )
6. hurrican - storm + wind = decrese ( Ideally model should return increase )

**Note :** In order to get all most similar words I haven't used topn feature.

### 1.10 Examples to find similarity between 2 words: 

In [47]:

print(f"Similarity between words 'high' and 'low' is : {model.wv.similarity('high', 'low')}")
print(f"Similarity between words 'peopl' and 'human' is : {model.wv.similarity('peopl', 'human')}")
print(f"Similarity between words 'extrem' and 'weather' is : {model.wv.similarity('extrem', 'weather')}")
print(f"Similarity between words 'australia' and 'warm' is : {model.wv.similarity('australia', 'warm')}")
print(f"Similarity between words 'australia' and 'warm' is : {model.wv.similarity('heat', 'water')}")

Similarity between words 'high' and 'low' is : 0.9467276334762573
Similarity between words 'peopl' and 'human' is : 0.8716719746589661
Similarity between words 'extrem' and 'weather' is : 0.9901080131530762
Similarity between words 'australia' and 'warm' is : 0.9703410267829895
Similarity between words 'australia' and 'warm' is : 0.992263913154602


### 1.11 Loaded pretrained model and performed 

We have loaded below pretrained models :

1. glove-wiki-gigaword-50 model
2. GoogleNewsvectorsnegative300 Model
3. glove-twitter-25 model

#### 1.11.1  glove-wiki-gigaword-50 model :

In [31]:
pr_model1 = api.load("glove-wiki-gigaword-50")

In [32]:
pr_model1.most_similar('heat')

[('temperature', 0.7556608319282532),
 ('cold', 0.7549160718917847),
 ('temperatures', 0.7502944469451904),
 ('humidity', 0.7431684732437134),
 ('hot', 0.7388110160827637),
 ('moisture', 0.7275059819221497),
 ('chill', 0.7268701195716858),
 ('water', 0.7164075970649719),
 ('heating', 0.7090611457824707),
 ('add', 0.7034947872161865)]

**Example 1 in Pretrained model :**

In [56]:
pr_model1.most_similar(positive=['ocean', 'arctic'], negative=['water'], topn=3)

[('antarctica', 0.832500159740448),
 ('antarctic', 0.7697972655296326),
 ('polar', 0.7220200896263123)]

**Example 2 in Pretrained model :**

In [64]:
pr_model1.most_similar(positive=['extreme', 'temperature'], negative=['weather'], topn=3)

[('absorption', 0.7154293060302734),
 ('polarization', 0.702642560005188),
 ('weight', 0.7001583576202393)]

**Example 3 in Pretrained model :**

In [58]:
pr_model1.most_similar(positive=['heat', 'cold', 'warm'], negative=['rise'], topn=3)

[('cool', 0.7649688720703125),
 ('chill', 0.7594031095504761),
 ('hot', 0.7589381337165833)]

**Example 4 in Pretrained model :**

In [66]:
pr_model1.most_similar(positive=['high', 'heat'], negative=['temperature'], topn=3)

[('home', 0.7420942187309265),
 ('over', 0.7382623553276062),
 ('while', 0.7318180799484253)]

**Example 5 in Pretrained model :**

In [59]:
pr_model1.most_similar(positive=['climate', 'change', 'temperature'], topn=3)

[('changes', 0.8372025489807129),
 ('conditions', 0.8201982975006104),
 ('impact', 0.8165292739868164)]

**Example 6 in Pretrained model :**

In [60]:
pr_model1.most_similar(positive=['hurricane', 'wind'], negative=['storm'], topn=3)

[('winds', 0.7835291624069214),
 ('gusts', 0.7169768810272217),
 ('sound', 0.6997408866882324)]

#### 1.11.2  GoogleNews-vectors-negative300 model :

In [None]:
from gensim.models import Word2Vec
import gensim

pr_model2 = gensim.models.KeyedVectors.load_word2vec_format('GoogleNewsvectorsnegative300.bin', binary=True)

**Example 1 to 5 on GoogleNews-vectors-negative300 model :**

In [71]:
pr_model2.most_similar(positive=['ocean', 'arctic'], negative=['water'], topn=3)

[('zeppelin', 0.8628155589103699),
 ('dragons', 0.8590071797370911),
 ('castle', 0.8434920310974121)]

In [74]:
pr_model2.most_similar(positive=['extreme', 'temperature'], negative=['weather'], topn=3)

[('capacity', 0.8226763606071472),
 ('variable', 0.8071826696395874),
 ('velocity', 0.8006629943847656)]

In [75]:
pr_model3.most_similar(positive=['heat', 'cold', 'warm'], negative=['rise'], topn=3)

[('wet', 0.8819375038146973),
 ('freezing', 0.844189465045929),
 ('dry', 0.8343328237533569)]

In [76]:
pr_model3.most_similar(positive=['high', 'heat'], negative=['temperature'], topn=3)

[('beat', 0.9010361433029175),
 ('play', 0.8993435502052307),
 ('boys', 0.8965255618095398)]

In [77]:
pr_model3.most_similar(positive=['high', 'heat'], negative=['temperature'], topn=3)

[('beat', 0.9010361433029175),
 ('play', 0.8993435502052307),
 ('boys', 0.8965255618095398)]

#### 1.11.3 Glove-twitter-25 model :

In [68]:

pr_model3 = api.load("glove-twitter-25") 

In [69]:
pr_model3.most_similar('heat')

[('thunder', 0.9267944097518921),
 ('bulls', 0.911600649356842),
 ('ball', 0.9085021615028381),
 ('beat', 0.9076730012893677),
 ('playoffs', 0.8956629037857056),
 ('lakers', 0.8936164379119873),
 ('basketball', 0.8829619884490967),
 ('cowboys', 0.8791283965110779),
 ('baseball', 0.8777887225151062),
 ('celtics', 0.8775449991226196)]

### 1.12 Analysis of pre-trained models :

From above outputs and with compare to our model's output we can clearly conclude few points :

- Similarity score in our model are very high compare to pretrained model but we are not able to get accurate result in our model.
- In pre-trained model similarity score is low compare to our model because pretrained model built upon large dataset.
- Using pre-trained model we can also compare how efficient our model is.

**Note:** As pre-trained model have large dataset we can't predict excepted values from most_similar.

### 1.13 Difference between most_similar and similar_by_vector :

- The most_similar similar function retrieves the vectors corresponding to "king", "woman" and "man", and normalizes them before computing king - man + woman.
- The function call model.similar_by_vector(v) just calls model.most_similar(positive=[v]).