<center style="color:red;"><h1>US patent phrase to phrase matching</h1></center>

<h4 style="color:blue;font-weight:bold;font-size:14;">In this dataset, you are presented pairs of phrases (an anchor and a target phrase) and asked to rate how similar they are on a scale from 0 (not at all similar) to 1 (identical in meaning).</h4>

<h2 style="color:red;">Files</h2>
<h4 style="color:blue;">train.csv - the training set, containing phrases, contexts, and their similarity scores<br>
test.csv - the test set set, identical in structure to the training set but without the score<br>
sample_submission.csv - a sample submission file in the correct format</h4>

<h2 style="color:red">Columns</h2>
<h4 style="color:blue">id - a unique identifier for a pair of phrases<br>
anchor - the first phrase<br>
target - the second phrase<br>
context - the CPC classification (version 2021.05), which indicates the subject within which the similarity is to be scored<br>
score - the similarity. This is sourced from a combination of one or more manual expert ratings.</h4>

<h2 style="color:red;">Score Meanings</h2>
<h4 style="color:blue;">The scores are in the 0-1 range with increments of 0.25 with the following meanings:<br>

1.0 - Very close match. This is typically an exact match except possibly for differences in conjugation, quantity (e.g. singular vs. plural), and addition or removal of stopwords (e.g. “the”, “and”, “or”).<br>
0.75 - Close synonym, e.g. “mobile phone” vs. “cellphone”. This also includes abbreviations, e.g. "TCP" -> "transmission control protocol".<br>
0.5 - Synonyms which don’t have the same meaning (same function, same properties). This includes broad-narrow (hyponym) and narrow-broad (hypernym) matches.<br>
0.25 - Somewhat related, e.g. the two phrases are in the same high level domain but are not synonyms. This also includes antonyms.<br>
0.0 - Unrelated.</h4>

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
train=pd.read_csv("../input/us-patent-phrase-to-phrase-matching/train.csv")
train.head()

In [None]:
test=pd.read_csv(r"../input/us-patent-phrase-to-phrase-matching/test.csv")
test.head()

In [None]:
submission=pd.read_csv("../input/us-patent-phrase-to-phrase-matching/sample_submission.csv")
submission.head()

In [None]:
train.shape,test.shape,submission.shape

<h1 style="color:red">Check the missing values in train dataset</h1>

In [None]:
import missingno as msno
msno.bar(train,sort='ascending',color='#7209b7',figsize=(14,5),fontsize=14)

<h3 style="color:blue;font-weight:bold;">No missing values found in the training dataset</h3>

In [None]:
(train['anchor'].value_counts()[:20]).plot.bar(figsize=(14,6))
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

<h3 style="color:blue;font-weight:bold;">First twenty most appeared phrases in anchor column</h3>

In [None]:
(train['target'].value_counts()[:20]).plot.bar(figsize=(14,6))
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

<h3 style="color:blue;font-weight:bold;">First twenty most appeared  second phrases in target column</h3>

In [None]:
(train['context'].value_counts()[:20]).plot.bar(figsize=(14,6))
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

<h3 style="color:blue;font-weight:bold;">First twenty most appeared values in context column</h3>

In [None]:
fig, ax  = plt.subplots(figsize=(16, 10))
#explode = (0.05, 0.05, 0.05, 0.05,0.05)
labels = list(train.score.value_counts().index)
sizes = train.score.value_counts().values
ax.pie(sizes,startangle=60, labels=labels,autopct='%1.0f%%', pctdistance=0.7,textprops={"size":16,"color":"lime"})
ax.add_artist(plt.Circle((0,0),0.4,fc='white'))
plt.show()

<h3 style="color:blue;font-weight:bold">0.0 has share of 20% in score columns<br>
0.25 has share of 32% in score columns<br>
0.5 has share of 34% in score columns<br>
0.75 has share of 11% in score columns<br>
1.0 has share of 3% in score columns<br></h3>

In [None]:
import numpy as np
words=np.array(train['anchor'].str.split().apply(len))
words_list=list(words)
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
sns.kdeplot(words_list,color="green")
plt.xlabel("Number of words",fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()


<h4 style="color:blue;font-weight:bold;">The above graph is showing the frequency of the words. And we can clearly see that most number of rows having 2 words in anchor column </h4>

In [None]:
words=np.array(train['target'].str.split().apply(len))
words_list=list(words)
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
sns.kdeplot(words_list,color="green")
plt.title("Number of words",fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

<h4 style="color:blue;font-weight:bold;">Similarly in target column most number of rows having 2 words. </h4>

In [None]:
train.columns

In [None]:
# importing all necessary modules
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import pandas as pd

comment_words = ''
stopwords = set(STOPWORDS)
 
# iterate through the anchor column 
for val in train.anchor:
     
    # typecaste each val to string
    val = str(val)
 
    # split the value
    tokens = val.split()
     
    # Converts each token into lowercase
    for i in range(len(tokens)):
        tokens[i] = tokens[i].lower()
     
    comment_words += " ".join(tokens)+" "
wordcloud = WordCloud(width = 800, height = 800,background_color ='white',stopwords = stopwords,min_font_size = 10).generate(comment_words)
 
# plot the WordCloud image                      
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
 
plt.show()

In [None]:
comment_words = ''
stopwords = set(STOPWORDS)
 
# iterate through the target column
for val in train.target:
     
    # typecaste each val to string
    val = str(val)
 
    # split the value
    tokens = val.split()
     
    # Converts each token into lowercase
    for i in range(len(tokens)):
        tokens[i] = tokens[i].lower()
     
    comment_words += " ".join(tokens)+" "
wordcloud = WordCloud(width = 800, height = 800,background_color ='white',stopwords = stopwords,min_font_size = 10).generate(comment_words)
 
# plot the WordCloud image                      
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
 
plt.show()