In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# EDA

## Train dataset

The purpose of this Exploratory Data Analysis (EDA) is first of all to get aquaintance with the data to manipulate in the kaggle challenge ["U.S. Patent Phrase to Phrase Matching"](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/overview).

In [None]:
# Charging train dataset
train = pd.read_csv("/kaggle/input/us-patent-phrase-to-phrase-matching/train.csv")
print(train.shape)
train.head()

In [None]:
train.describe(include='object')

In [None]:
# The maximum width in characters of a column in the repr of a pandas data structure
pd.set_option('display.max_colwidth', None)

In [None]:
train['score'].unique()

The 'score' column contains only 5 different score values.

In [None]:
train.groupby('score').count()

## Score meanings according to [Data Description](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/data?select=train.csv)

The scores are in the 0-1 range with increments of 0.25 with the following meanings:

    1.0 - Very close match. This is typically an exact match except possibly for differences in conjugation, quantity (e.g. singular vs. plural), and addition or removal of stopwords (e.g. “the”, “and”, “or”).
    0.75 - Close synonym, e.g. “mobile phone” vs. “cellphone”. This also includes abbreviations, e.g. "TCP" -> "transmission control protocol".
    0.5 - Synonyms which don’t have the same meaning (same function, same properties). This includes broad-narrow (hyponym) and narrow-broad (hypernym) matches.
    0.25 - Somewhat related, e.g. the two phrases are in the same high level domain but are not synonyms. This also includes antonyms.
    0.0 - Unrelated.


Obviously, there are more phrases that are "Synonyms which don’t have the same meaning" than other categories.

In [None]:
train.groupby('score').nunique()

In [None]:
train[train['anchor'] == 'abatement'].count()

There are more unique elements in "target" and it's logic as one type of anchor or context can have different targets.

# Test Corpus

In [None]:
# Charging 'test' corpus
test = pd.read_csv("/kaggle/input/us-patent-phrase-to-phrase-matching/test.csv")

In [None]:
test.shape

'test' dataset contains only 36 lines

# Additional Context

As we can see, the phrases in 'train' dataset are very short. The NLP tasks on short phrases are the most challenging ones. We can find more contexte in another database "Cooperative Patent Classification Codes Meaning" ("cpc-codes") added by @xhlulu which is available on Kaggle. 

In [None]:
titles = pd.read_csv("../input/cpc-codes/titles.csv")
print(titles.shape)
titles.head()

The 'cpc-codes' dataset provides more context in "title" colomn. We can use it to inrich the 'train' dataset. Let's go into more details of this dataset.

In [None]:
titles[titles['section'] == 'A']['code'].unique

One section can contain sub-  and sub- sub- sections.

'pandas-profiling' library allows to get a fast EDA with a nice layout. However, one should be careful in interpreting the obtained results. Depending on the dataset some part of analysis can be sensless.

In [None]:
!pip install pandas-profiling

In [None]:
import pandas_profiling
#Generating PandasProfiling Report
report = pandas_profiling.ProfileReport(titles)

In [None]:
report

# Wordcloud

A wordcloud allows us to perseive in a glance the main semantic trends in the dataset. The words that appear bigger than others in the cloud are the words that have the biggest occurances in the dataset.

In [None]:
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np

In [None]:
cloud_txt = ' '.join(train['target'].values.tolist())

In [None]:
wc = WordCloud()

In [None]:
wc.generate(cloud_txt)

In [None]:
wc.to_file('output.png')

In [None]:
%matplotlib inline
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

Based on wordcloud, we can suppose that the 'train' dataset contains phrases describing different devices, systems and materials. Let's make a wordcloud for the 'cpc-codes' dataset.

In [None]:
cloud_titles = ' '.join(titles['title'].values.tolist())

As the 'cpc-codes' dataset is bigger than the 'train' one, let's make the wordcloud image bigger.

In [None]:
wordcloud = WordCloud(width = 800, 
                      height = 800,
                      background_color ='white',
                      min_font_size = 10,
                      ).generate(cloud_txt) 

# plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None)
plt.title("- Most Common Words within U.S. Patent Phrases -",
           size=22, weight="bold")
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 

plt.show()

The general topic remains the same. The patents give the technical description of different devices, systems and materials. The patent code descriptions give precisions and details.

The EDA step is finished. The next step is training and testing different models.