# **ASSESSMENT FOR DATA SCIENCE**

## **PROBLEM STATEMENT**

**Dataset (attached with the task): The data contains a pair of paragraphs. These text paragraphs are randomly sampled from a raw dataset. Each pair of sentences may or may not be semantically similar. The candidate is to predict a value between 0-1 indicating the similarity between the pair of text paras. A sample of a similar dataset will be used as test data, therefore it’s crucial to the model solution using provided dataset.**

## **Dataset Overview and Task Description:**

**Build an algorithm/model that can quantify the degree of similarity between the two text-based on Semantic similarity. Semantic Textual Similarity (STS) assesses the degree to which two sentences are semantically equivalent to each other.**

**1- Means highly similar.**

**0- Means highly dissimilar.**

## **Practice Adopted**

**This problem revolves around Natural Language Processing (NLP) and, particularly in NLP model development, the critical role of text embedding cannot be overstated. Text embedding is the process of converting sentences into numerical vectors, laying the foundation for assessing the similarity between them through metrics like euclidean distance or cosine similarity.**

**In our case, cosine similarity is the chosen method for gauging sentence similarity. However, a challenge arises when attempting to convert keywords into vectors, as it necessitates consideration of context and meaning beyond just the keywords themselves.**

**To address this, we employ the Universal Sentence Encoder (USE), a powerful tool that encodes text into higher-dimensional vectors, specifically designed for semantic similarity tasks. Notably, the pre-trained Universal Sentence Encoder is readily available in TensorFlow Hub, enhancing its accessibility and applicability.**

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/precily-text/Precily_Text_Similarity.csv


## **Libraries**

**we'll import the necessary libraries and load the TF Hub module for the Universal Sentence Encoder.**

In [2]:
import tensorflow as tf       
import pandas as pd          
import tensorflow_hub as hub  
# Load Universal Sentence Encoder
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" 
model = hub.load(module_url)
def embed(input):
  return model(input)
     

## **Reading Data**

In [3]:
df = pd.read_csv('/kaggle/input/precily-text/Precily_Text_Similarity.csv', encoding = 'unicode_escape')

In [4]:
df.head()

Unnamed: 0,text1,text2
0,broadband challenges tv viewing the number of ...,gardener wins double in glasgow britain s jaso...
1,rap boss arrested over drug find rap mogul mar...,amnesty chief laments war failure the lack of ...
2,player burn-out worries robinson england coach...,hanks greeted at wintry premiere hollywood sta...
3,hearts of oak 3-2 cotonsport hearts of oak set...,redford s vision of sundance despite sporting ...
4,sir paul rocks super bowl crowds sir paul mcca...,mauresmo opens with victory in la amelie maure...


In [5]:
df.shape 

(3000, 2)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text1   3000 non-null   object
 1   text2   3000 non-null   object
dtypes: object(2)
memory usage: 47.0+ KB


In [7]:
df['text1'][0]

'broadband challenges tv viewing the number of europeans with broadband has exploded over the past 12 months  with the web eating into tv viewing habits  research suggests.  just over 54 million people are hooked up to the net via broadband  up from 34 million a year ago  according to market analysts nielsen/netratings. the total number of people online in europe has broken the 100 million mark. the popularity of the net has meant that many are turning away from tv  say analysts jupiter research. it found that a quarter of web users said they spent less time watching tv in favour of the net  the report by nielsen/netratings found that the number of people with fast internet access had risen by 60% over the past year.  the biggest jump was in italy  where it rose by 120%. britain was close behind  with broadband users almost doubling in a year. the growth has been fuelled by lower prices and a wider choice of always-on  fast-net subscription plans.  twelve months ago high speed internet

In [8]:
type(df['text1'][0])

str

## **Encoding text to vectors:**

**We utilized version 4 of the Universal Sentence Encoder (USE), which has been trained on the entirety of Wikipedia data. Our sentences consist of sequences of words, and when we input these sentences into our USE4 model, it generates a "dense numeric vector" for each sentence. In essence, by providing a pair of sentences, we obtain a corresponding pair of vectors.**

In [9]:
pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: sentence_transformers
  Building wheel for sentence_transformers (setup.py) ... [?25ldone
[?25h  Created wheel for sentence_transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125923 sha256=c32e1d777648fbf3a10bb5f0f53a8bc32f75c39e069a9040da5ff1b8a2db754b
  Stored in directory: /root/.cache/pip/wheels/62/f2/10/1e606fd5f02395388f74e7462910fe851042f97238cbbd902f
Successfully built sentence_transformers
Installing collected packages: sentence_transformers
Successfully installed sentence_transformers-2.2.2
Note: you may need to restart the kernel to use updated packages.


In [10]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
message = [df['text1'][0], df['text2'][0]]
message_embeddings = model.encode(message)
print(message_embeddings)


.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[[ 3.14776778e-01 -1.08762495e-01  8.57530255e-03 -3.49456459e-01
  -1.25621662e-01 -1.65014163e-01 -7.95810372e-02  1.87892467e-01
   5.24866246e-02  1.22429214e-01 -2.06158191e-01  1.68513343e-01
   2.04076752e-01 -2.09093928e-01  2.80888528e-02 -1.35204911e-01
   6.61901897e-03 -8.05484951e-01  1.49082437e-01  1.61286652e-01
  -3.62972379e-01 -4.64010954e-01 -2.05634743e-01 -1.67882703e-02
   2.94802308e-01 -1.92963585e-01 -5.79013228e-02 -1.50705799e-01
  -1.12011343e-01  2.53247827e-01 -1.79116115e-01  1.91414565e-01
   2.12627605e-01 -2.17836007e-01  1.52685687e-01 -5.72447777e-01
  -1.44791961e-01 -2.70303875e-01 -4.90977556e-01  7.66512193e-03
   9.03810486e-02 -8.63603279e-02 -5.06499447e-02 -6.37765080e-02
   1.24148671e-02  1.31530911e-02 -1.39562875e-01  3.83864194e-01
  -9.92045924e-02  2.19964445e-01 -9.64863747e-02  1.81051686e-01
   1.27893295e-02  2.99371481e-01 -1.88089550e-01 -5.04158497e-01
  -1.48730487e-01  3.92503470e-01 -1.10033490e-01  3.45170915e-01
   3.40092

In [11]:
type(message_embeddings)

numpy.ndarray

In [12]:
type(message_embeddings[0])

numpy.ndarray

In [13]:
type(tf.make_ndarray(tf.make_tensor_proto(message_embeddings)))

numpy.ndarray

In [14]:
a_np = tf.make_ndarray(tf.make_tensor_proto(message_embeddings))

## **Finding Cosine similarity**

**We executed a for loop that iterated through all sentence pairs in our dataset, extracting the vector representation for each sentence through our USE4 model. Subsequently, for each pair of vectors, we computed the cosine similarity using the standard cosine formula. In essence, this process involved calculating the cosine of the angle between the two vectors.**

## *cosin = dot(a,b)/norm(a)*norm(b)*

**The obtained cosine similarity values fall within the range of -1 to 1. However, for our specific requirements, where we need values ranging from 0 to 1, we adjusted the values by adding 1 to the cosine similarity result. Following this, we performed normalization to ensure that the values lie within the desired range. This process helps us obtain similarity scores that are more aligned with our specific scale of 0 to 1.**

In [15]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

# Initialize Sentence Transformer Model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Load Text Data from DataFrame
texts1 = df['text1'].tolist()
texts2 = df['text2'].tolist()

# Generate Sentence Embeddings
embeddings1 = model.encode(texts1)
embeddings2 = model.encode(texts2)

# Compute Cosine Similarity Matrix
cosine_similarities = np.diag(cosine_similarity(embeddings1, embeddings2))

# Create DataFrame with Similarity Scores
Ans = pd.DataFrame({'Similarity_Score': cosine_similarities})

# Print Resulting DataFrame
print(Ans)

Batches:   0%|          | 0/94 [00:00<?, ?it/s]

Batches:   0%|          | 0/94 [00:00<?, ?it/s]

      Similarity_Score
0             0.073249
1             0.181460
2             0.223079
3            -0.003888
4             0.163516
...                ...
2995          0.086124
2996          0.265468
2997          0.064845
2998          0.069381
2999         -0.043127

[3000 rows x 1 columns]


In [16]:
Ans.head()

Unnamed: 0,Similarity_Score
0,0.073249
1,0.18146
2,0.223079
3,-0.003888
4,0.163516


In [17]:
#Join DataFrames based on index
df = df.join(Ans)

**The initial 'Similarity_Score' values are between -0.2 and 1.2. The line df['Similarity_Score'] = (df['Similarity_Score'] + 1) /2 normalizes these values, and the resulting DataFrame will have 'Similarity_Score' values scaled to a range between 0 & 1.**

In [18]:
#Perform normalization 
df['Similarity_Score'] = (df['Similarity_Score'] + 1)

In [19]:
df.head()

Unnamed: 0,text1,text2,Similarity_Score
0,broadband challenges tv viewing the number of ...,gardener wins double in glasgow britain s jaso...,1.073249
1,rap boss arrested over drug find rap mogul mar...,amnesty chief laments war failure the lack of ...,1.18146
2,player burn-out worries robinson england coach...,hanks greeted at wintry premiere hollywood sta...,1.223079
3,hearts of oak 3-2 cotonsport hearts of oak set...,redford s vision of sundance despite sporting ...,0.996112
4,sir paul rocks super bowl crowds sir paul mcca...,mauresmo opens with victory in la amelie maure...,1.163516


In [20]:
df['Similarity_Score'] = df['Similarity_Score']/df['Similarity_Score'].abs().max()

In [21]:
df.head()

Unnamed: 0,text1,text2,Similarity_Score
0,broadband challenges tv viewing the number of ...,gardener wins double in glasgow britain s jaso...,0.536624
1,rap boss arrested over drug find rap mogul mar...,amnesty chief laments war failure the lack of ...,0.59073
2,player burn-out worries robinson england coach...,hanks greeted at wintry premiere hollywood sta...,0.611539
3,hearts of oak 3-2 cotonsport hearts of oak set...,redford s vision of sundance despite sporting ...,0.498056
4,sir paul rocks super bowl crowds sir paul mcca...,mauresmo opens with victory in la amelie maure...,0.581758


## **Submission**

In [22]:
Submission = df[['Similarity_Score']]

In [23]:
Submission.head()

Unnamed: 0,Similarity_Score
0,0.536624
1,0.59073
2,0.611539
3,0.498056
4,0.581758


In [24]:
Submission.set_index("Similarity_Score", inplace = True)

In [26]:
predictions_df = pd.DataFrame(df)

In [27]:
submission_file_path = '/kaggle/working/submission.csv'
predictions_df.to_csv(submission_file_path, index=False)

In [28]:
print(predictions_df)

                                                  text1  \
0     broadband challenges tv viewing the number of ...   
1     rap boss arrested over drug find rap mogul mar...   
2     player burn-out worries robinson england coach...   
3     hearts of oak 3-2 cotonsport hearts of oak set...   
4     sir paul rocks super bowl crowds sir paul mcca...   
...                                                 ...   
2995  uk directors guild nominees named martin scors...   
2996  u2 to play at grammy awards show irish rock ba...   
2997  pountney handed ban and fine northampton coach...   
2998  belle named  best scottish band  belle & sebas...   
2999  criminal probe on citigroup deals traders at u...   

                                                  text2  Similarity_Score  
0     gardener wins double in glasgow britain s jaso...          0.536624  
1     amnesty chief laments war failure the lack of ...          0.590730  
2     hanks greeted at wintry premiere hollywood sta...        